GPT-4 – 2023's State of the Art LLM
GPT-4 represents a major leap forward in large language model capabilities. Developed by OpenAI, it builds on the architecture and strengths of GPT-3 while achieving new levels of scale and performance.
With GPT-4, OpenAI's objective was to create a model that was over 10x larger than GPT-3. This requires not just greater compute for training, but entirely new approaches to model architecture and inference serving.
Some key facts about GPT-4:
- Total parameters: ~1.8 trillion (over 10x more than GPT-3)
- Architecture: Uses a mixture of experts (MoE) model to improve scalability
- Training compute: Trained on ~25,000 Nvidia A100 GPUs over 90-100 days
- Training data: Trained on a dataset of ~13 trillion tokens
- Inference compute: Runs on clusters of 128 A100 GPUs for efficient deployment
- Context length: Supports up to 32,000 tokens of context
GPT-4 Model Card
|Model type||Transformer with Mixture-of-Experts|
|Context Window||8-32,000 tokens|
|Launch Date||March 2023|
|Current Version||1.1 (Release 06.13)|
|Training dataset||13 trillion tokens (web text, books, other)|
|Training||90 days on 25,000 Nvidia A100 GPUs|
|Inference||128 A100 GPU clusters|
|Data sources||CommonCrawl, WebText2, books, Wikipedia, Reddit, Amazon reviews|
|Data volume||~13 trillion tokens|
|Data prep||Deduplication, cleaning, filtering|
|Potential biases||Language, gender, race representation|
API and Data Format
- Chat Completion API
- Multi-turn message types
- System, Function, User, Assistant
- JSONL fine-tuning with message arrays
- Text generation
- Question answering
- Conversational agents
- Language: English
- Capabilities: Text generation, question answering, text classification
- Modalities: Text
- Ethical considerations: Potential for bias, harmful outputs, misuse
- Perplexity: Unknown
- F1: Unknown
- Accuracy: Unknown
- Fine-tuning unavailable (GA release target October 2023)
- Potential for harmful, biased outputs
- Lack of grounded reasoning
- Factually incorrect outputs
- Model mistakes as truth
- Top-k sampling
- Top-p sampling
GPT-4 was tested on a translated version of the MMLU benchmark in 26 different languages. It outperformed GPT-3.5 and other LLMs in 24 out of the 26 languages tested, including low-resource languages like Latvian, Welsh, and Swahili. The Datacamp and MakeUseOf articles also note GPT-4's multilingual capabilities, with support for translation between English, French, German, Spanish, Chinese, Japanese, Korean and more. Translated Labs points out that GPT-4 has disparities in performance between English and other languages due to the predominance of English in its training data. Their T-LM product helps address this by translating prompts to enhance GPT-4's capabilities in 200 languages.
GPT-4 has potential for misuse and harmful societal impacts. Review outputs carefully before use. Do not treat as factual statements. For questions or concerns, contact [email protected]
The model architecture of GPT-4 moves away from a standard transformer approach. Instead it utilizes a mixture of experts (MoE) design.
In the MoE architecture, there are separate expert neural networks that specialize in certain tasks or data types. For each inference query, the appropriate expert models are selected to handle that specific input.
This provides two major advantages:
The overall model can scale up in size significantly, while only routing inference through a small subset of expert parameters for any given query. This keeps inference costs practical.
The mixture of experts can develop specialized knowledge, improving overall capabilities.
Specifically, GPT-4 consists of:
- 16 expert models, each with ~111B parameters
- 2 experts are activated per inference query
- 55B shared parameters for attention
- Results in ~280B parameters used per inference pass
This architecture allows GPT-4 to reach over 1.8 trillion parameters in total, while only utilizing several hundred billion per query.
Training a model as large as GPT-4 requires extensive computational resources. It pushed the limits of existing infrastructure.
Key facts about the GPT-4 training process:
- Trained on ~25,000 Nvidia A100 GPUs simultaneously
- The batch size increased over time, eventually reaching 60 million tokens
- Trained for a total of 90-100 days continuously
- Required 2.15e25 floating point operations (FLOPs) in total
- Trained on a dataset of ~13 trillion tokens
To make this feasible, extensive parallelism techniques were used:
- 8-way tensor parallelism to distribute the model across GPUs
- 15-way pipeline parallelism to split batches into stages
- Various clustering topologies to maximize inter-GPU bandwidth
The result was one of the largest compute jobs ever for an AI model.
Deploying GPT-4 for inference at scale is a significant challenge due to its size and mixture of experts architecture. Efficient inference directly impacts costs.
Key facts about GPT-4 inference:
- Runs on clusters of 128 A100 GPUs
- Leverages 8-way tensor parallelism and 16-way pipeline parallelism
- Carefully balances latency, throughput, and utilization
- May use speculative decoding to improve throughput by 2-3x
- Multi-query attention reduces memory needs for long contexts
Inference clusters are designed to maximize throughput and hardware utilization. This keeps costs lower per query.
There are still challenges around consistently batching queries for diverse expert models. But overall, the infrastructure can effectively deploy GPT-4 without pricing becoming prohibitive.
Understanding Token Dropping in GPT-4
The mixture-of-experts (MoE) architecture used in GPT-4 relies on a token routing mechanism to determine which experts process each token. This can lead to certain tokens being "dropped" or unprocessed.
GPT-4 uses a simple top-2 token routing approach, where each token is sent to the 2 most likely experts according to the router. The experts themselves have a set capacity limit on how many tokens they can process per batch.
When aggregated across long input sequences and large batch sizes, the expert capacity is often exceeded, resulting in tokens being dropped. Counterintuitively, some level of dropping is actually beneficial for model performance and efficiency, as it prevents overloading experts.
The drops are non-deterministic - running the same prompt twice can lead to different drops each time. This is because the tokens are dropped differently across batches depending on capacity. The model itself remains deterministic.
While OpenAI could tweak expert capacity and reduce drops, this would substantially increase inference time and cost. The current tradeoff enables inexpensive deployment at scale. Dropping is inherent to sparse MoE designs.
Understanding how routing leads to drops provides insight into observations of randomness in GPT-4. The drops vary across usages, but the model logic itself does not.
GPT-4 demonstrates impressive progress in language model foundations. However, future models will likely need to expand beyond a purely text-based approach.
Some areas of focus moving forward:
- Architectures that natively support vision, audio, speech, and text together
- Training models end-to-end across different data modalities
- Expanding beyond mixtures of experts for greater scalability
- Increasing training data diversity and size by orders of magnitude
- Advancing multi-modal capabilities for complex reasoning
- Optimizing model designs for real-world task performance
With each generation, OpenAI is pushing closer towards artificial general intelligence. While they are further along than any other LLM/AI research company, we are still far away from true general intelligence, lacking key attributes such as volition, decision making, memory, real-time knowledge synthesis, and other attributes.
GPT-4 shows they have the technical capabilities to make massive leaps forward with each iteration.
The future capabilities of these models remain incredibly exciting.