Klu raises $1.7M to empower AI Teams  

What is BLEU?

by Stephen M. Walker II, Co-Founder / CEO

What is the BLEU Score (Bilingual Evaluation Understudy)?

The BLEU (Bilingual Evaluation Understudy) score is an algorithm used for evaluating the quality of text that has been machine-translated from one natural language to another. It was invented at IBM in 2001 and is one of the first metrics to claim a high correlation with human judgments of quality.

The BLEU score is calculated by comparing the machine-translated text (candidate) with one or more professionally human-translated texts (references). The quality is considered to be the correspondence between a machine's output and that of a human. The central idea behind BLEU is that "the closer a machine translation is to a professional human translation, the better it is".

How is the bleu score calculated?

The BLEU score assesses machine translation quality by measuring how closely it matches human translation. It combines n-gram precision and a brevity penalty for a comprehensive evaluation. N-gram precision counts the shared n-grams (1 to 4 words in sequence) between the machine and human translations, divided by the total n-grams in the machine translation. The brevity penalty applies when the machine translation is shorter than the reference, with a penalty factor of 1 if the machine translation is equal to or longer than the reference.

The score is calculated for individual translated segments—generally sentences—and then averaged over the whole corpus to reach an estimate of the translation's quality. Intelligibility or grammatical correctness are not taken into account.

The BLEU score is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. A BLEU score of 1 means that the candidate sentence perfectly matches one of the reference sentences. However, even human translators do not achieve a perfect score of 1.0.

The BLEU score is calculated by comparing the n-grams of machine-translated sentences to the n-gram of human-translated sentences. An n-gram is a contiguous sequence of n items from a given sample of text or speech. The BLEU score also incorporates a brevity penalty to prevent the system from favoring shorter translations.

What is a good BLEU score?

A good BLEU score, which stands for Bilingual Evaluation Understudy, varies depending on the complexity of the text and the language pair involved in the translation. However, broadly speaking, a score closer to 1 indicates high correspondence with the human reference translation, and is generally considered good. For instance, a BLEU score above 0.7 is often seen as a strong score. But it's important to note that even a high BLEU score doesn't guarantee perfect translations, as the system only compares n-grams and lacks semantic understanding.

BLEU ScoreInterpretation
< 0.10Barely useful
0.10 - 0.19Difficult to understand
0.20 - 0.29Understandable, but with significant errors
0.30 - 0.40Good, understandable translations
0.40 - 0.50High-quality translations
0.50 - 0.60Fluent, very high-quality translations

0.60 | Often surpasses human translation quality |

What are the key features of the BLEU Score?

The BLEU Score (Bilingual Evaluation Understudy) is a key metric in machine translation, assessing the quality of text by measuring its similarity to human reference translations. It evaluates the precision of n-grams—sequences of n items from the text—and incorporates a brevity penalty to discourage overly short translations. Despite its utility, the BLEU score has limitations; it does not account for semantic meaning and may inaccurately rate grammatically correct but contextually flawed translations. Nevertheless, its efficiency and simplicity make it a widely used standard in natural language processing for tasks like translation and text summarization.

How does it work?

The BLEU score evaluates machine translation by calculating the precision of n-grams—word sequences that appear in both the machine-generated and human reference translations. It adjusts for length by applying a brevity penalty to shorter translations, ensuring that neither length nor brevity is unfairly rewarded.

A geometric mean of the precision scores across different n-gram lengths is computed and then multiplied by the brevity penalty to produce the final BLEU score.

What are its benefits and limitations?

The BLEU score offers several advantages, such as providing a quantitative measure for comparing machine and human translations, being computationally efficient, and incorporating n-gram precision to reflect certain linguistic structures.

However, it has limitations, including a lack of semantic analysis, potential overrating of grammatically correct but contextually inaccurate translations, dependency on reference translations that may not cover all correct possibilities, and not always being the most suitable metric for every application.

Additionally, the selection of n-gram length and the method for applying the brevity penalty can introduce subjectivity into the scoring process.

More terms

Context Window (LLMs)

The context window is akin to a short-term memory that determines how much text the model can consider for generating responses. Specifically, it refers to the number of tokens—individual pieces of text from tokenization—that the model processes at one time. This capacity varies among LLMs, affecting their input handling and comprehension abilities. For instance, GPT-3 can manage a context of 2,000 tokens, while GPT-4 Turbo extends to 128,000 tokens. Larger context windows enable the processing of more extensive information, which is crucial for tasks that require the model to learn from examples and respond accordingly.

Read more

Mathemical Optimization Methods

Mathematical optimization, or mathematical programming, seeks the optimal solution from a set of alternatives, categorized into discrete or continuous optimization. It involves either minimizing or maximizing scalar functions, where the goal is to find the variable values that yield the lowest or highest function value.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free