What is BLEU?

by Stephen M. Walker II, Co-Founder / CEO

What is the BLEU Score (Bilingual Evaluation Understudy)?

The BLEU (Bilingual Evaluation Understudy) score is an algorithm used for evaluating the quality of text that has been machine-translated from one natural language to another. It was invented at IBM in 2001 and is one of the first metrics to claim a high correlation with human judgments of quality.

The BLEU score is calculated by comparing the machine-translated text (candidate) with one or more professionally human-translated texts (references). The quality is considered to be the correspondence between a machine's output and that of a human. The central idea behind BLEU is that "the closer a machine translation is to a professional human translation, the better it is".

How is the bleu score calculated?

The BLEU score assesses machine translation quality by measuring how closely it matches human translation. It combines n-gram precision and a brevity penalty for a comprehensive evaluation. N-gram precision counts the shared n-grams (1 to 4 words in sequence) between the machine and human translations, divided by the total n-grams in the machine translation. The brevity penalty applies when the machine translation is shorter than the reference, with a penalty factor of 1 if the machine translation is equal to or longer than the reference.

The score is calculated for individual translated segments—generally sentences—and then averaged over the whole corpus to reach an estimate of the translation's quality. Intelligibility or grammatical correctness are not taken into account.

The BLEU score is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. A BLEU score of 1 means that the candidate sentence perfectly matches one of the reference sentences. However, even human translators do not achieve a perfect score of 1.0.

The BLEU score is calculated by comparing the n-grams of machine-translated sentences to the n-gram of human-translated sentences. An n-gram is a contiguous sequence of n items from a given sample of text or speech. The BLEU score also incorporates a brevity penalty to prevent the system from favoring shorter translations.

What is a good BLEU score?

A good BLEU score, which stands for Bilingual Evaluation Understudy, varies depending on the complexity of the text and the language pair involved in the translation. However, broadly speaking, a score closer to 1 indicates high correspondence with the human reference translation, and is generally considered good. For instance, a BLEU score above 0.7 is often seen as a strong score. But it's important to note that even a high BLEU score doesn't guarantee perfect translations, as the system only compares n-grams and lacks semantic understanding.

BLEU ScoreInterpretation
< 0.10Barely useful
0.10 - 0.19Difficult to understand
0.20 - 0.29Understandable, but with significant errors
0.30 - 0.40Good, understandable translations
0.40 - 0.50High-quality translations
0.50 - 0.60Fluent, very high-quality translations

0.60 | Often surpasses human translation quality |

What are the key features of the BLEU Score?

The BLEU Score (Bilingual Evaluation Understudy) is a key metric in machine translation, assessing the quality of text by measuring its similarity to human reference translations. It evaluates the precision of n-grams—sequences of n items from the text—and incorporates a brevity penalty to discourage overly short translations. Despite its utility, the BLEU score has limitations; it does not account for semantic meaning and may inaccurately rate grammatically correct but contextually flawed translations. Nevertheless, its efficiency and simplicity make it a widely used standard in natural language processing for tasks like translation and text summarization.

How does it work?

The BLEU score evaluates machine translation by calculating the precision of n-grams—word sequences that appear in both the machine-generated and human reference translations. It adjusts for length by applying a brevity penalty to shorter translations, ensuring that neither length nor brevity is unfairly rewarded.

A geometric mean of the precision scores across different n-gram lengths is computed and then multiplied by the brevity penalty to produce the final BLEU score.

What are its benefits and limitations?

The BLEU score offers several advantages, such as providing a quantitative measure for comparing machine and human translations, being computationally efficient, and incorporating n-gram precision to reflect certain linguistic structures.

However, it has limitations, including a lack of semantic analysis, potential overrating of grammatically correct but contextually inaccurate translations, dependency on reference translations that may not cover all correct possibilities, and not always being the most suitable metric for every application.

Additionally, the selection of n-gram length and the method for applying the brevity penalty can introduce subjectivity into the scoring process.

More terms

Andrej Karpathy

Andrej Karpathy is a renowned computer scientist and artificial intelligence researcher known for his work on deep learning and neural networks. He served as the director of artificial intelligence and Autopilot Vision at Tesla, and currently works for OpenAI.

Read more

What is backward chaining?

Backward chaining in AI is a goal-driven, top-down approach to reasoning, where the system starts with a goal or conclusion and works backward to find the necessary conditions and rules that lead to that goal. It is commonly used in expert systems, automated theorem provers, inference engines, proof assistants, and other AI applications that require logical reasoning. The process involves looking for rules that could have resulted in the conclusion and then recursively looking for facts that satisfy these rules until the initial conditions are met. This method typically employs a depth-first search strategy and is often contrasted with forward chaining, which is data-driven and works from the beginning to the end of a logic sequence.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free