Llama 3.1 405b
by Stephen M. Walker II, Co-Founder / CEO
Llama 3.1 405b is an open-source large language model (LLM) developed by Meta AI. It is a family of pre-trained and instruction-tuned large language models, available in 8 billion, 70 billion, and 405 billion parameter sizes. Llama 3.1 405b is the successor to Llama 3 and includes models optimized for dialogue, known as Llama-3.1 Instruct.
The Llama 3.1 405b models were trained on over 15 trillion tokens from publicly available sources. The models are designed to predict the most plausible follow-on text using a neural network with billions of variables (parameters).
Benchmarks
Llama 3.1 introduces a new 405B parameter model and 3.1 models show improved performance over Llama 3 models in various benchmarks, particularly in instruction-tuned evaluations. Significant gains are observed in HumanEval, MATH (CoT), and Multilingual MGSM benchmarks, indicating enhanced capabilities in code evaluation, complex reasoning, and multilingual tasks.
Base Pretrained Models
Benchmark | Llama 3 8B | Llama 3.1 8B | Llama 3 70B | Llama 3.1 70B | Llama 3.1 405B |
---|---|---|---|---|---|
MMLU | 66.7 | 66.7 | 79.5 | 79.3 | 85.2 |
MMLU PRO (CoT) | 36.2 | 37.1 | 55.0 | 53.8 | 61.6 |
AGIEval English | 47.1 | 47.8 | 63.0 | 64.6 | 71.6 |
CommonSenseQA | 72.6 | 75.0 | 83.8 | 84.1 | 85.8 |
Winogrande | - | 60.5 | - | 83.3 | 86.7 |
BIG-Bench Hard | 61.1 | 64.2 | 81.3 | 81.6 | 85.9 |
ARC-Challenge | 79.4 | 79.7 | 93.1 | 92.9 | 96.1 |
TriviaQA-Wiki | 78.5 | 77.6 | 89.7 | 89.8 | 91.8 |
SQuAD | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 |
QuAC (F1) | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 |
BoolQ | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 |
DROP (F1) | 58.4 | 59.5 | 79.7 | 79.6 | 84.8 |
Instruction Tuned Models
Benchmark | Llama 3 8B Instruct | Llama 3.1 8B Instruct | Llama 3 70B Instruct | Llama 3.1 70B Instruct | Llama 3.1 405B Instruct |
---|---|---|---|---|---|
MMLU | 66.7 | 66.7 | 79.5 | 79.3 | 85.2 |
MMLU (CoT) | 36.2 | 37.1 | 55.0 | 53.8 | 61.6 |
MMLU PRO (CoT) | 45.5 | 48.3 | 63.4 | 66.4 | 73.3 |
IFEval | 76.8 | 80.4 | 82.9 | 87.5 | 88.6 |
ARC-C | 82.4 | 83.4 | 94.4 | 94.8 | 96.9 |
GPQA | 34.6 | 30.4 | 39.5 | 41.7 | 50.7 |
MuSR | 56.3 | 45.7 | 55.1 | 58.1 | 60.7 |
HumanEval | 60.4 | 72.6 | 81.7 | 80.5 | 89.0 |
MBPP ++ base version | 70.6 | 72.8 | 82.5 | 86.0 | 88.6 |
MultiPL-E HumanEval | 50.8 | 64.0 | 65.5 | 75.2 | |
MultiPL-E MBPP | 52.4 | 62.0 | 62.0 | 65.7 | |
GSM-8K (CoT) | 80.6 | 84.5 | 93.0 | 95.1 | 96.8 |
MATH (CoT) | 29.1 | 51.9 | 51.0 | 68.0 | 73.8 |
API-Bank | 48.3 | 82.6 | 85.1 | 90.0 | 92.0 |
Berkeley Function Calling | 60.3 | 76.1 | 83.0 | 84.8 | 88.5 |
Gorilla Benchmark API Bench | 1.7 | 8.2 | 14.7 | 29.7 | 35.3 |
Nexus | 18.1 | 38.5 | 47.8 | 56.7 | 58.7 |
Multilingual MGSM | 68.9 | 85.6 | 86.9 | 91.6 |
Llama-3 405B and Llama 3.1 represent significant advancements in Meta's open-source large language models.
Llama-3 405B
Meta is set to release Llama-3 405B, its largest open-source language model to date, on July 23. This model boasts 405 billion parameters and includes multimodal capabilities, allowing it to process and generate both text and images. The release aims to democratize advanced AI technology, providing developers and startups with unprecedented access to powerful AI capabilities. By making the model's weights openly accessible, Meta is fostering transparency and encouraging innovation in the AI community.
Key features and implications include:
- Multimodal Capabilities: The ability to handle text and images enhances the model's versatility, making it applicable in a wide range of scenarios.
- Open-Source Accessibility: Providing access to the model's weights allows developers to build specialized models and high-quality training datasets without restrictions, potentially accelerating innovation in niche fields.
- Efficiency and Ecosystem: Llama-3 405B is expected to integrate within a layered system of models, where smaller models handle basic tasks and the 405B model steps in for complex verifications, optimizing computational resources and response times.
Llama 3.1
Llama 3.1 builds on the foundations of Llama 3, with improvements focused on efficiency, accuracy, and usability. While specific details about Llama 3.1 are less publicized compared to the major announcement of Llama-3 405B, it is expected to include enhancements based on user feedback and further optimizations in model architecture and training processes. Meta's ongoing commitment to open-source development ensures that subsequent iterations like Llama 3.1 continue to push the boundaries of what is achievable with large language models.
Llama 3 has two main variants in different sizes — 8B and 70B. These variants have different performance times and speeds, but all are capable of generating coherent text responses to any commands the user gives.
The Llama 3 model was trained using a specific structure for prompts, which relied on four special tokens: <s>
for the beginning of the entire sequence, <<SYS>>\n
for the beginning of the system message, \n<</SYS>>\n\n
for the end of the system message, and [INST]
and [/INST]
for the beginning and end of some instructions, respectively.
Llama 3 adopts the model architecture of Llama 2 with a few modifications. It uses a RoPE scheme for positional embeddings, which balances the absolute and relative position of each token in a sequence. This approach encodes the absolute position with a rotation matrix and adds relative position information directly into the self-attention operation.
Llama 3 is trained with a longer context length of 8K tokens, compared to Llama 2 which was trained with a 4K context length. Additionally, Llama 3 adopts grouped query attention (GQA) within each of its layers.
Llama 3 was trained using solely public sources of data for pre-training, with a deliberate choice to ensure that the pre-training process can be openly replicated by anyone with sufficient compute resources. Compared to Llama 2, Llama 3 adopts a new mixture of pre-training data, with sources known to be high-quality and factual sampled more heavily, and increases the size of the pre-training dataset significantly.
Llama 3 was fine-tuned using a large dataset in a similar manner to proprietary models, producing the Llama 3-Instruct model that is optimized for dialogue-based applications. The alignment process, which teaches the model the correct output style or format, was performed with a goal of reducing hallucinations, avoiding unsafe questions, following detailed instructions, and more.
The Llama 3 models were trained using bfloat16, but the original inference uses float16. The checkpoints uploaded on the Hub use torch_dtype = 'float16'
, which will be used by the AutoModel API to cast the checkpoints from torch.float32
to torch.float16
.
To use Llama 3, you can use Hugging Face Transformers. The most flexible approach is to load the largest of the Llama 3 models that has been fine-tuned for chat — the Llama-3-70b-chat-hf model. However, it should be noted that Llama 3 requires at least 35GB of GPU memory, which may limit its use to high-end GPUs like the A100.
Transformer Architecture
The foundation of Llama 3, a classic architecture used in many AI models.
RMSNorm
Short for Root Mean Square Normalization, a technique used by Meta to handle the 15 trillion tokens and internal weights of Llama 3.
SwiGLU Activation Function
The activation function chosen by Meta for Llama 3 to determine whether a given neuron should be active or not.
RoPE
Short for Rotary Positional Embedding, a mathematical method used in Llama 3 to ensure the model understands the importance of word positions in sentences.
Ghost Attention (GAtt)
A fine-tuning method introduced by Meta for Llama 3. It helps control dialogue flow over multiple turns by synthetically concatenating the 'act as' instruction to all user messages in a conversation.
Context Length
The amount of information Llama 3 can consider from previous inputs. Llama 3 has a context length of 8192 tokens, twice the context length of Llama 2.
Grouped Query Attention (GQA)
A feature in Llama 3 for improved inference scalability.
Fine-tuning
The process of adjusting the weights of a pre-trained model to make it perform better on the desired task. Llama 3 uses fine-tuning methods like reinforcement learning with human feedback (RLHF), supervised fine-tuning (SFT), and initial and iterative reward modeling.
Fill-in-the-middle (FIM) Capability
A feature in Llama 3 that allows it to insert code into existing code, supporting tasks like code completion.
Instruction Tuning
A training process where the model is fed a natural language instruction input and the expected output. This makes it better at understanding what people expect out of their prompts. Used in Llama 3 Instruct.
Llama 3-Instruct
A chatbot version of Llama 3, fine-tuned for chat-style interactions through supervised fine-tuning and reinforcement learning with human feedback (RLHF).
Code Llama
A code generation model built on Llama 3, trained on 500B tokens of code. It supports common programming languages including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.