MTEB: Massive Text Embedding Benchmark

by Stephen M. Walker II, Co-Founder / CEO

What is MTEB (Massive Text Embedding Benchmark)?

The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark designed to evaluate the performance of text embedding models across a wide range of tasks and datasets. It was introduced to address the issue that text embeddings were commonly evaluated on a limited set of datasets from a single task, making it difficult to track progress in the field and to understand whether state-of-the-art embeddings on one task would generalize to others.

MTEB encompasses 8 embedding tasks, which include bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. It covers a total of 58 datasets and spans 112 languages, making it one of the most extensive benchmarks for text embeddings to date.

The benchmark has been used to evaluate 33 different models, revealing that no single text embedding method consistently outperforms others across all tasks. This suggests that the field has not yet converged on a universal text embedding method.

MTEB is designed to be massive, multilingual, and extensible. It includes a large number of datasets and summarizes thousands of results on its leaderboard. The benchmark is also open to contributions, such as new tasks, datasets, metrics, or leaderboard additions.

The benchmark comes with open-source code and a public leaderboard, which can be found on the Hugging Face platform. This allows researchers and practitioners to compare the performance of various models and to submit their own models for evaluation.

The MTEB project is hosted on GitHub, where the code and resources for the benchmark are available under the Apache-2.0 license. The benchmark has been discussed in academic conferences and papers, highlighting its role in providing clarity on model performance across various embedding tasks and serving as a gateway to finding universal text embeddings applicable to a variety of tasks.

Current Leaderboard

As of January 15, 2024, the current leaderboard is led by voyage-lite-02-intruct.

Rank	Model	Embedding Dimensions	Max Tokens	Average (56 datasets)	Classification Average (12 datasets)	Clustering Average (11 datasets)	Pair Classification Average (3 datasets)	Reranking Average (4 datasets)	Retrieval Average (15 datasets)	STS Average (10 datasets)	Summarization Average (1 dataset)
1	voyage-lite-02-instruct	1024	4000	67.13	79.25	52.42	86.87	58.24	56.6	85.79	31.01
2	e5-mistral-7b-instruct	14.22	4096	32768	66.63	78.47	50.26	88.34	60.21	56.89	84.63
3	UAE-Large-V1	1.34	1024	512	64.64	75.58	46.73	87.25	59.88	54.66	84.54
4	text-embedding-3-large	3072	8191	64.59	75.45	49.01	85.72	59.16	55.44	81.73	29.92
5	voyage-lite-01-instruct	1024	4000	64.49	74.79	47.4	86.57	59.74	55.58	82.93	30.97
6	Cohere-embed-english-v3.0	1024	512	64.47	76.49	47.43	85.84	58.01	55	82.62	30.18
7	bge-large-en-v1.5	1.34	1024	512	64.23	75.97	46.08	87.12	60.03	54.29	83.11
8	Cohere-embed-multilingual-v3.0	1024	512	64.01	76.01	46.6	86.15	57.86	53.84	83.15	30.99
9	bge-base-en-v1.5	0.44	768	512	63.55	75.53	45.77	86.55	58.86	53.25	82.4
10	ember-v1	1.34	1024	512	63.54	75.99	45.58	87.37	60.04	51.92	83.34

How does MTEB work?

MTEB works by providing a comprehensive framework for evaluating the performance of text embedding models across these tasks. It includes a total of 58 datasets and spans 112 languages. The datasets contain varying text lengths and are grouped into three categories: Sentence to sentence, Paragraph to paragraph, and Sentence to paragraph.

Models are benchmarked on these tasks and datasets, and the results are made publicly available on a leaderboard. This allows for a comparison of the performance of different models across a wide range of tasks and languages. The benchmarking process has revealed that no single text embedding method consistently outperforms others across all tasks, indicating that the field has not yet converged on a universal text embedding method.

MTEB comes with open-source code, allowing researchers and developers to use the benchmark for their own models and contribute to the project. This makes MTEB not only a tool for evaluation but also a platform for collaboration and progress in the field of text embeddings.

The Massive Text Embedding Benchmark (MTEB) covers the following 8 embedding tasks:

Bitext mining
Classification
Clustering
Pair classification
Reranking
Retrieval
Semantic Textual Similarity (STS)
Summarization

When to use MTEB

The Massive Text Embedding Benchmark (MTEB) is used when you want to evaluate the performance of text embedding models across a diverse range of tasks and datasets. It is particularly useful when you want to compare your model's performance with other models in the field.

MTEB is beneficial when you want to:

Benchmark your model: MTEB allows you to benchmark any model that produces embeddings and add its results to the public leaderboard. This can help you understand how your model performs relative to other models on a variety of tasks and datasets.
Choose a model for a specific task: Since model performance varies significantly depending on the task and dataset, MTEB can help you decide which model to use for a specific task. By checking the various tabs of the leaderboard, you can identify the models that perform best on the task you are interested in.
Contribute to the field: MTEB is open-source and encourages contributions from the community. If you have developed a new task, dataset, metric, or model, you can contribute it to MTEB and help advance the field of text embeddings.
Research: If you are a researcher in the field of text embeddings, MTEB can provide you with a comprehensive benchmark for your studies. It can help you understand the current state of the art and identify areas where further research is needed.

Limitations of MTEB

The Massive Text Embedding Benchmark (MTEB) is a comprehensive tool for evaluating the performance of text embedding models across a variety of tasks. However, it does have several limitations:

Lack of Long Document Datasets: MTEB covers multiple text lengths, but it does not include very long documents. The longest datasets in MTEB only have a few hundred words, and longer text sizes could be relevant for use cases like retrieval.
Task Imbalance: Tasks in MTEB have a different amount of datasets. For instance, summarization consists of only a single dataset. This means MTEB average scores, which are computed over all datasets, are biased towards tasks with many datasets, notably retrieval, classification, and clustering.
Limited Multilinguality: While MTEB contains multilingual classification, STS, and bitext mining datasets, retrieval and clustering are English-only. This limits the benchmark's ability to comprehensively evaluate models geared towards multilingual retrieval datasets.
Absence of Code Datasets: MTEB does not contain any code datasets that could be used to benchmark code models.
Limited Dataset Diversity: Some tasks are underrepresented in MTEB, such as summarization or pair classification. As MTEB grows, there are plans to add more datasets to these underrepresented tasks.

Despite these limitations, MTEB remains a valuable tool for benchmarking text embedding models, and its extensibility allows for the potential addition of new tasks, datasets, and metrics.

Klu is remote-first and global

Follow us

MTEB: Massive Text Embedding Benchmark

What is MTEB (Massive Text Embedding Benchmark)?

Current Leaderboard

How does MTEB work?

When to use MTEB

Limitations of MTEB

More terms

Logistic Regression

What is ChatGPT?

It's time to build

LLMOps

Guides

LLMs