Llama 3

by Stephen M. Walker II, Co-Founder / CEO

Llama 3 is an open-source large language model (LLM) developed by Meta AI. It is a family of pre-trained and instruction-tuned large language models, available in 8 billion and 70 billion parameter sizes. Llama 3 is the successor to Llama 2 and includes models optimized for dialogue, known as Llama-3 Instruct. This new technology brings significant advancements but also potential risks, such as unpredictable outputs and the need for tailored safety measures.

The Llama 3 models were trained on over 15 trillion tokens from publicly available sources. The models are designed to predict the most plausible follow-on text using a neural network with billions of variables (parameters).

Model Overview

The Meta Llama 3 model is a state-of-the-art large language model developed by Meta AI. As a foundation model, it has been pre-trained on a massive dataset of text and fine-tuned for specific tasks such as chat and question-answering. Available in two sizes — 8 billion and 70 billion parameters — Meta Llama 3 is versatile enough to cater to a wide range of applications. Its open-source nature allows developers to access and modify the model weights and code, making it a valuable resource for both research and commercial use. The Meta Llama 3 model stands out for its advanced capabilities and the flexibility it offers to the AI community.

Base pretrained foundation models

Benchmark	Llama 3 8B	Llama2 7B	Llama2 13B	Llama 3 70B	Llama2 70B
MMLU (5-shot)	66.6	45.7	53.8	79.5	69.7
AGIEval English (3-5 shot)	45.9	28.8	38.7	63.0	54.8
CommonSenseQA (7-shot)	72.6	57.6	67.6	83.8	78.7
Winogrande (5-shot)	76.1	73.3	75.4	83.1	81.8
BIG-Bench Hard (3-shot, CoT)	61.1	38.1	47.0	81.3	65.7
ARC-Challenge (25-shot)	78.6	53.7	67.6	93.0	85.3
TriviaQA-Wiki (5-shot)	78.5	72.1	79.6	89.7	87.5
SQuAD (1-shot)	76.4	72.2	72.1	85.6	82.6
QuAC (1-shot, F1)	44.4	39.6	44.9	51.1	49.4
BoolQ (0-shot)	75.7	65.5	66.9	79.0	73.1
DROP (3-shot, F1)	58.4	37.9	49.8	79.7	70.2

When deploying new technologies like Llama 3, it is crucial to use a base model to ensure safety and reliability. The base model helps filter inputs and outputs, enhancing system-level safety alongside model-level safety, particularly when developers integrate the model into their applications.

Instruction tuned models

Benchmark	Llama 3 8B	Llama 2 7B	Llama 2 13B	Llama 3 70B	Llama 2 70B
MMLU (5-shot)	68.4	34.1	47.8	82.0	52.9
GPQA (0-shot)	34.2	21.7	22.3	39.5	21.0
HumanEval (0-shot)	62.2	7.9	14.0	81.7	25.6
GSM-8K (8-shot, CoT)	79.6	25.7	41.2	93.0	57.5
MATH (4-shot, CoT)	30.0	3.8	6.7	50.4	11.6

Llama 3 has several improvements over the original Llama models. It uses an optimized transformer architecture with a tokenizer that has a vocabulary of 128K tokens. The context length in Llama 3 has been increased to 8192 tokens, and it employs Grouped-Query Attention (GQA) for improved inference efficiency.

Llama 3 includes a code generation model, which supports common programming languages such as Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash. The testing conducted so far has evaluated Llama 3's performance and safety, but it has not been comprehensive. Further safety testing is recommended for specific applications to address potential risks and limitations.

Llama 3 is freely available for almost anyone to use for research and commercial purposes, with some restrictions. Companies with more than 700 million monthly users have to ask for special permission to use Llama, so it’s off limits for the likes of Apple, Google, and Amazon.

In terms of performance, Llama 3 outperforms other open-source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests. However, it’s worth noting that Llama 3 may not perform as well as GPT-4 or PaLM 2 on some benchmarks.

Llama 3

Llama 3: The third iteration of Meta's open-source LLM. It's a collection of models in 8B and 70B sizes, optimized for dialogue and outperforming many open-source chat models on industry benchmarks. It uses a neural network with billions of variables, employing the same transformer architecture and development concepts as its counterparts like GPT-4 and OpenAI's PaLM 2.

Llama 3 has two main variants in different sizes — 8B and 70B. These variants have different performance times and speeds, but all are capable of generating coherent text responses to any commands the user gives.

The Llama 3 model was trained using a specific structure for prompts, which relied on four special tokens: <s> for the beginning of the entire sequence, <<SYS>>\n for the beginning of the system message, \n<</SYS>>\n\n for the end of the system message, and [INST] and [/INST] for the beginning and end of some instructions, respectively.

Llama 3 adopts the model architecture of Llama 2 with a few modifications. It uses a RoPE scheme for positional embeddings, which balances the absolute and relative position of each token in a sequence. This approach encodes the absolute position with a rotation matrix and adds relative position information directly into the self-attention operation.

Llama 3 is trained with a longer context length of 8K tokens, compared to Llama 2 which was trained with a 4K context length. Additionally, Llama 3 adopts grouped query attention (GQA) within each of its layers.

Llama 3 was trained using solely public sources of data for pre-training, with a deliberate choice to ensure that the pre-training process can be openly replicated by anyone with sufficient compute resources. Compared to Llama 2, Llama 3 adopts a new mixture of pre-training data, with sources known to be high-quality and factual sampled more heavily, and increases the size of the pre-training dataset significantly.

Llama 3 was fine-tuned using a large dataset in a similar manner to proprietary models, producing the Llama 3-Instruct model that is optimized for dialogue-based applications. The alignment process, which teaches the model the correct output style or format, was performed with a goal of reducing hallucinations, avoiding unsafe questions, following detailed instructions, and more.

The Llama 3 models were trained using bfloat16, but the original inference uses float16. The checkpoints uploaded on the Hub use torch_dtype = 'float16', which will be used by the AutoModel API to cast the checkpoints from torch.float32 to torch.float16.

To use Llama 3, you can use Hugging Face Transformers. The most flexible approach is to load the largest of the Llama 3 models that has been fine-tuned for chat — the Llama-3-70b-chat-hf model. However, it should be noted that Llama 3 requires at least 35GB of GPU memory, which may limit its use to high-end GPUs like the A100.

Transformer Architecture

The foundation of Llama 3, a classic architecture used in many AI models.

RMSNorm

Short for Root Mean Square Normalization, a technique used by Meta to handle the 15 trillion tokens and internal weights of Llama 3.

SwiGLU Activation Function

The activation function chosen by Meta for Llama 3 to determine whether a given neuron should be active or not.

RoPE

Short for Rotary Positional Embedding, a mathematical method used in Llama 3 to ensure the model understands the importance of word positions in sentences.

Ghost Attention (GAtt)

A fine-tuning method introduced by Meta for Llama 3. It helps control dialogue flow over multiple turns by synthetically concatenating the 'act as' instruction to all user messages in a conversation.

Context Length

The amount of information Llama 3 can consider from previous inputs. Llama 3 has a context length of 8192 tokens, twice the context length of Llama 2.

Grouped Query Attention (GQA)

A feature in Llama 3 for improved inference scalability.

Fine-tuning

The process of adjusting the weights of a pre-trained model to make it perform better on the desired task. Llama 3 uses fine-tuning methods like reinforcement learning with human feedback (RLHF), supervised fine-tuning (SFT), and initial and iterative reward modeling.

Fill-in-the-middle (FIM) Capability

A feature in Llama 3 that allows it to insert code into existing code, supporting tasks like code completion.

Instruction Tuning

A training process where the model is fed a natural language instruction input and the expected output. This makes it better at understanding what people expect out of their prompts. Used in Llama 3 Instruct.

Llama 3-Instruct

A chatbot version of Llama 3, fine-tuned for chat-style interactions through supervised fine-tuning and reinforcement learning with human feedback (RLHF).

Code Llama

A code generation model built on Llama 3, trained on 500B tokens of code. It supports common programming languages including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.

Safety and Responsibility

At Meta AI, safety and responsibility are paramount. Meta recognizes that AI models like Meta Llama 3 have the potential to bring significant benefits to society, but they also come with risks and limitations that must be carefully managed.

To this end, Meta has developed a comprehensive set of guidelines and best practices for the safe and responsible development and deployment of AI models, including Meta Llama 3. These guidelines cover crucial aspects such as testing and evaluation, data quality and bias, and transparency and explainability. By adhering to these principles, Meta aims to ensure that Meta Llama 3 is used in a manner that is both ethical and beneficial.

Licensing and Usage

The Meta Llama 3 model is licensed under the Meta Llama 3 Community License Agreement, which allows both non-commercial and commercial use of the model. Users are required to provide attribution to Meta AI and adhere to specific terms and conditions, including restrictions on using the model for malicious or harmful purposes. The model is readily available for download on the Meta Llama 3 GitHub page, and users can also access it through the Hugging Face platform. This accessibility ensures that a wide range of users can leverage the capabilities of Meta Llama 3 for their projects.

Community and Support

The development and deployment of AI models like Meta Llama 3 are best achieved through a collaborative community effort. Meta encourages developers, researchers, and users to contribute to the Meta Llama 3 project by reporting issues, suggesting new features, and sharing their experiences with the model. Support is available through the GitHub page, where users can ask questions and receive assistance from Meta's team and the broader community. Additionally, Meta offers a suite of resources and tools, including model cards, testing and evaluation frameworks, and security guidelines, to help users develop and deploy AI models responsibly and safely. By fostering a strong community, Meta aims to advance the field of AI collectively.

Klu is remote-first and global

Follow us

Llama 3

Model Overview

Base pretrained foundation models

Instruction tuned models

Llama 3

Transformer Architecture

RMSNorm

SwiGLU Activation Function

RoPE

Ghost Attention (GAtt)

Context Length

Grouped Query Attention (GQA)

Fine-tuning

Fill-in-the-middle (FIM) Capability

Instruction Tuning

Llama 3-Instruct

Code Llama

Safety and Responsibility

Licensing and Usage

Community and Support

More terms

What is Prompt Engineering for LLMs?

What are the ethical implications of artificial intelligence?

It's time to build

LLMOps

Guides

LLMs