Afleveringen
-
PPO (Proximal Policy Optimization) is a reinforcement learning algorithm that balances simplicity, stability, sample efficiency, general applicability, and strong performance. PPO replaced TRPO (Trust Region Policy Optimization) as the default algorithm at OpenAI due to its simpler implementation and greater computational efficiency, while maintaining comparable performance. PPO approximates TRPO by clipping the policy gradient and using first-order optimization, avoiding the computationally intensive Hessian matrix and strict KL divergence constraints of TRPO. The clipping mechanism in PPO constrains policy updates, prevents excessively large changes, and promotes stability during training. Its surrogate objectives and clip function enable the reuse of training data, making PPO sample efficient, especially for complex tasks.
-
Andrej Karpathy's tech talk (youtube), provides a comprehensive yet accessible overview of Large Language Models (LLMs) like ChatGPT. The talk details the process of building an LLM, including pre-training, data processing, and neural network training.Key stages include downloading and filtering internet text, tokenizing the text, and training neural networks to model token relationships. The discussion covers the distinction between base models and assistants, highlighting fine-tuning to create conversational AIs. It also addresses challenges like hallucinations and mitigation strategies, such as knowledge-based refusal and tool use. The talk further explores reinforcement learning and the emergence of "thinking" in models.
-
Zijn er afleveringen die ontbreken?
-
Andrej Karpathy's talk, "Intro to Large Language Models," demystifies LLMs by portraying them as systems with two key components:a parameters file (the weights of the neural network) anda run file (the code that runs the network). The creation of these files starts with a computationally intensive training process, where a large amount of internet text is compressed into the model's parameters. The scaling laws show that LLM performance depends on the number of parameters and the amount of training data.Karpathy reviews how LLMs are evolving to incorporate external tools and multiple modalities. He presents his view of LLMs as the kernel process of an emerging operating system and also discusses the security challenges of LLMs, including jailbreak attacks, prompt injection attacks, and data poisoning.
-
DeepSeek-V2 is a Mixture-of-Experts (MoE) language model that balances strong performance with economical training and efficient inference. It uses a total of 236B parameters, with 21B activated for each token, and supports a context length of 128K tokens. Key architectural innovations includeMulti-Head Latent Attention (MLA), which compresses the KV cache for faster inference, andDeepSeekMoE, which enables economical training through sparse computation. Compared to DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts maximum generation throughput by 5.76 times. It is pre-trained on 8.1T tokens of high-quality data and further aligned through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
-
Matrix calculus is essential for understanding and implementing deep learning. It provides the mathematical tools to optimize neural networks using gradient descent. The Jacobian matrix, a key concept, organizes partial derivatives of vector-valued functions. The vector chain rule simplifies derivative calculations in nested functions, common in neural networks. Automatic differentiation, used in modern libraries, relies on these principles. Grasping matrix calculus allows for a deeper understanding of model training and the implementation of custom neural networks.
-
'S1' refers to simple test-time scaling, an efficient approach to enhance language model reasoning with minimal resources. It involves training a model on a small, carefully curated dataset like s1K and using budget forcing to control test-time compute. Budget forcing enforces maximum or minimum thinking tokens by appending delimiters or the word "Wait". The s1-32B model, developed using this method, outperforms other models on competition math questions. The approach combines a curated dataset with a straightforward test-time technique, leading to strong reasoning performance and effective test-time scaling.
-
Reinforcement Learning from Human Feedback (RLHF) incorporates human preferences into AI systems, addressing problems where specifying a clear reward function is difficult. The basic pipeline involves training a language model, collecting human preference data to train a reward model, and optimizing the language model with an RL optimizer using the reward model. Techniques like KL divergence are used for regularization to prevent over-optimization. RLHF is a subset of preference fine-tuning techniques. It has become a crucial technique in post-training to align language models with human values and elicit desirable behaviors.
-
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that enhances mathematical reasoning in large language models (LLMs). It is like training students in a study group, where they learn by comparing answers without a tutor. GRPO eliminates the need for a critic model, unlike Proximal Policy Optimization (PPO), making it more resource efficient. It calculates advantages based on relative rewards within the group and directly adds KL divergence to the loss function. GRPO uses both outcome and process supervision, and can be applied iteratively, further enhancing performance. This approach is effective at improving LLMs' math skills with reduced training resources.
-
Model/Knowledge distillation is a technique to transfer knowledge from a cumbersome model, like a large neural network or an ensemble of models, to a smaller, more efficient model. The smaller model is trained using "soft targets," which are the class probabilities produced by the larger model, rather than the usual "hard targets" of correct class labels. These soft targets contain more information, including how the cumbersome model generalizes and the similarity structure of the data. A temperature parameter is used to soften the probability distributions, making the information more accessible for the smaller model to learn. This process improves the smaller model's generalization ability and efficiency. Distillation allows the smaller model to achieve performance comparable to the larger model with less computation.
-
Qwen2.5 is a series of large language models (LLMs) with significant improvements over previous models, focusing on efficiency, performance, and long sequence handling. Key architectural advancements include Grouped Query Attention (GQA) for better memory management, Mixture-of-Experts (MoE) for enhanced capacity, and Rotary Positional Embeddings (RoPE) for effective long-sequence modeling. Qwen2.5 uses two-phase pre-training and progressive context length expansion to enhance long-context capabilities, along with techniques like YARN, Dual Chunk Attention (DCA), and sparse attention. It also features an expanded tokenizer and uses SwiGLU activation, QKV bias and RMSNorm for stable training.
-
The Qwen2 series of large language models introduces several key enhancements over its predecessors. It employs Grouped Query Attention (GQA) and Dual Chunk Attention (DCA) for improved efficiency and long-context handling, using YARN to rescale attention weights. The models utilize fine-grained Mixture-of-Experts (MoE) and have a reduced KV size. Pre-training data was significantly increased to 7 trillion tokens with more code, math and multilingual content, and post-training involves supervised fine-tuning (SFT) and direct preference optimization (DPO). These changes allow for enhanced performance, especially in coding, mathematics, and multilingual tasks, and better performance in long-context scenarios.
-
Qwen-1, also known as QWEN, is a series of large language models that includes base pretrained models, chat models, and specialized models for coding and math. These models are trained on a massive dataset of 3 trillion tokens using byte pair encoding for tokenization, and they feature a modified Transformer architecture with untied embeddings and rotary positional embeddings. The chat models (QWEN-CHAT) are aligned to human preferences using Supervised Finetuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). QWEN models have strong performance, outperforming many open-source models, but they generally lag behind models like GPT-4.
-
OpenAI's o1 is a generative pre-trained transformer (GPT) model, designed for enhanced reasoning, especially in science and math. It uses a 'chain of thought' approach, spending more time "thinking" before answering, making it better at complex tasks. While not a successor to GPT-4o, o1 excels in scientific and mathematical benchmarks, and is trained with a new optimization algorithm. Different versions like o1-preview and o1-mini are available. Limitations include high computational cost, occasional "fake alignment," and a hidden reasoning process, and potential replication of training data.
-
GPT-4o is a multilingual, multimodal model that can process and generate text, images, and audio and represents a significant advancement over previous models like GPT-4 and GPT-3.5. GPT-4o is faster and more cost-effective, has improved performance in multiple areas, and natively supports voice-to-voice. GPT-4o's knowledge is limited to what was available up to October 2023. It has a context length of 128k tokens. The cost of training GPT-4 was more than $100 million, and it has 1 trillion parameters.
-
Kimi k1.5 is a multimodal LLM trained with reinforcement learning (RL). Key aspects include: long context scaling to 128k, improving performance with increased context length; improved policy optimization using a variant of online mirror descent; and a simplistic framework that enables planning and reflection without complex methods. It uses a reference policy in its off-policy RL approach, and long2short methods such as model merging and DPO to transfer knowledge from long-CoT to short-CoT models, achieving state-of-the-art reasoning performance. The model is jointly trained on text and vision data.
-
DeepSeek-R1 is a language model focused on enhanced reasoning, employing reinforcement learning (RL) and building upon the DeepSeek-V3-Base model. It uses Group Relative Policy Optimization (GRPO) to reduce computational costs by eliminating the need for a separate critic model, which is commonly used in other algorithms such as PPO. The model uses a multi-stage training pipeline including an initial fine-tuning with cold-start data, followed by reasoning-oriented RL, and supervised fine-tuning (SFT) using rejection sampling, and a final RL stage. A rule-based reward system avoids reward hacking. DeepSeek-R1 also employs a language consistency reward during RL to address language mixing. The model's reasoning capabilities are then distilled into smaller models. DeepSeek-R1 achieves performance comparable to, and sometimes surpassing, OpenAI's o1 series on various reasoning, math, and coding tasks.
-
Claude 3 is a family of large multimodal AI models developed by Anthropic, with a focus on safety, interpretability, and user alignment. The models, which include Opus, Sonnet, and Haiku, excel in reasoning, math, coding, and multilingual understanding. They are designed to be helpful, honest, and harmless assistants and can process text, audio, and visual inputs. Claude 3 models use Constitutional AI principles, aiming for more ethical and reliable responses. They have improved abilities in long context comprehension, and have shown strong performance in various tests, often outperforming previous Claude models and sometimes matching or exceeding GPT models in some benchmarks.
-
GPT-4, or Generative Pre-trained Transformer 4, is a large multimodal language model created by OpenAI, and the fourth in the GPT series. It is a significant advancement over previous models such as GPT-3, with improvements in model size, performance, contextual understanding, and safety. GPT-4 uses a Transformer architecture, a deep learning model that has revolutionized natural language processing. It can process both text and images, and it has a larger context window than GPT-3, enabling it to handle longer documents and more complex tasks. GPT-4 was trained using a combination of publicly available data and licensed third-party data, and then fine-tuned using reinforcement learning and human feedback. It also has increased reasoning and generalization abilities, making it more reliable for advanced and specialized applications.
-
Training large language models (LLMs) is challenging due to the large amount of GPU memory and long training times required. Several parallelism paradigms enable model training across multiple GPUs, and various model architecture and memory-saving designs make it possible to train very large neural networks. The optimal model size and number of training tokens should be scaled equally, with a doubling of model size requiring a doubling of training tokens. Current large language models are significantly under-trained. Techniques such as data parallelism, model parallelism, pipeline parallelism, and tensor parallelism can be used to distribute the training workload. Other strategies include CPU offloading, activation recomputation, mixed-precision training, and compression to save memory.
-
MiniMax-01 is a series of large language and vision-language models that use lightning attention and a mixture of experts (MoE) to achieve long context processing. The models, MiniMax-Text-01 and MiniMax-VL-01, match the performance of top-tier models, like GPT-4o and Claude-3.5-Sonnet, while offering 20-32 times longer context windows, reaching up to 4 million tokens during inference. The models use a hybrid architecture, with linear and softmax attention mechanisms, and are trained on large datasets of text, code, and image-caption pairs. They also use a multi-stage training process with supervised fine-tuning and reinforcement learning to optimize their capabilities in long-context and real-world scenarios.
- Laat meer zien