đť•Ź Grok-2 Beta Release
by Stephen M. Walker II, Co-Founder / CEO
Top tip
Grok-2 fine-tuning is now available, enabling Grok-2 customization for your specific use cases.
What is Grok-2?
Grok-2 represents a significant advancement in x.ai's language model offerings, building upon the capabilities of its predecessor, Grok-1.5. Released on the đť•Ź platform, Grok-2 and its smaller counterpart, Grok-2 mini, are designed to provide intelligent chat assistants with superior reasoning, chat, and coding functionalities.
Benchmark | Grok-1.5 | Grok-2 mini‡ | Grok-2‡ |
---|---|---|---|
GPQA | 35.9% | 51.0% | 56.0% |
MMLU | 81.3% | 86.2% | 87.5% |
MMLU-Pro | 51.0% | 72.0% | 75.5% |
MATH§ | 50.6% | 73.0% | 76.1% |
HumanEval¶ | 74.1% | 85.7% | 88.4% |
MMMU | 53.6% | 63.2% | 66.1% |
MathVista | 52.8% | 68.1% | 69.0% |
DocVQA | 85.6% | 93.2% | 93.6% |
These models are currently in beta, with grok-2 mini available currently on x.com and both models released to Enterprise APIs in the coming weeks.
Performance Benchmarks
Grok-2 has been rigorously evaluated across various benchmarks, demonstrating its prowess in reasoning, reading comprehension, math, science, and coding. It outperforms previous models and competes effectively with other leading AI models.
Notably, Grok-2 excels in vision-based tasks, setting new standards in visual math reasoning and document-based question answering.
The Grok-2 models are evaluated across a series of academic benchmarks, including reasoning, reading comprehension, math, science, and coding. Both Grok-2 and Grok-2 mini exhibit significant improvements over the previous Grok-1.5 model. These models achieve performance levels that are competitive with other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Furthermore, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and document-based question answering (DocVQA).
Benchmark | Grok-2 | Gemini Pro 1.5 | Llama 3 405B | GPT-4o | Claude 3.5 Sonnet |
---|---|---|---|---|---|
GPQA | 56.0% | 46.2% | 51.1% | 53.6% | 59.6% |
MMLU | 87.5% | 85.9% | 88.6% | 88.7% | 88.3% |
MMLU-Pro | 75.5% | 69.0% | 73.3% | 72.6% | 76.1% |
MATH§ | 76.1% | 67.7% | 73.8% | 76.6% | 71.1% |
HumanEval¶ | 88.4% | 71.9% | 89.0% | 90.2% | 92.0% |
MMMU | 66.1% | 62.2% | 64.5% | 69.1% | 68.3% |
MathVista | 69.0% | 63.9% | — | 63.8% | 67.7% |
DocVQA | 93.6% | 93.1% | 92.2% | 92.8% | 95.2% |
Real-Time Information Integration
Grok-2 integrates real-time information from the đť•Ź platform, enhancing its ability to provide accurate and timely responses. This feature is particularly beneficial for users seeking up-to-date insights and solutions.
Enterprise API Access
Later this month, Grok-2 and Grok-2 mini will be accessible through x.ai's enterprise API, offering developers a robust platform for integrating advanced AI capabilities into their applications. The API promises low-latency access and enhanced security features, making it a valuable tool for businesses worldwide.
Future Developments
x.ai is committed to continuous improvement and innovation. The Grok-2 release marks a pivotal moment in AI development, with plans to expand its capabilities further. Users can expect ongoing enhancements and new features that will push the boundaries of what AI can achieve.
Grok-2 is a testament to x.ai's dedication to advancing AI technology, providing users with a powerful tool for a wide range of applications, from everyday tasks to complex problem-solving scenarios.
†GPT-4-Turbo and GPT-4o scores are from the May 2024 release.
††Claude 3 Opus and Claude 3.5 Sonnet scores are from the June 2024 release.
‡ Grok-2 MMLU, MMLU-Pro, MMMU and MathVista were evaluated using 0-shot CoT.
§ For MATH, we present maj@1 results.
¶ For HumanEval, we report pass@1 benchmark scores.