NLP – BLEU Score for Evaluating Neural Machine Translation – Python

26 July 2024

2

Neural Machine Translation (NMT) is a standard task in NLP that involves translating a text from a source language to a target language. BLEU (Bilingual Evaluation Understudy) is a score used to evaluate the translations performed by a machine translator. In this article, we’ll see the mathematics behind the BLEU score and its implementation in Python.

BLEU Score

As stated above BLEU Score is an evaluation metric for Machine Translation tasks. It is calculated by comparing the n-grams of machine-translated sentences to the n-gram of human-translated sentences. Usually, it has been observed that the BLEU score decreases as the sentence length increases. This, however, might vary depending upon the model used for translation. The following is a graph depicting the variation of the BLEU Score with the sentence length.

Mathematical Expression for BLEU Score

In cases for BLEU Score translation, Mathematically, BLEU Score is given as follows:

$BLEU Score = BP * exp(\sum_{n=1}^{4} \frac{1}{n} P_n)$

Here,

BP stands for Brevity Penalty, which penalizes the score when the Machine Translation is too short compared to the Reference (Correct) translations.

n ∈ [1,4]

P_n is the n-gram modified precision score.

The mathematical expression for Brevity Penalty is given as follows:

$Brevity Penalty = min(1, \frac{Machine\,Translation\,Output\,Length}{Maximum\,Reference\,Output\,Length})$

P_n can be defined as follows:

$P_n = \frac{\sum_{}^{} n-grams\,count\,in\,Machine\,Translated\,Text}{\sum_{}^{} n-grams\,count\,in\,Reference\,Text}$

For a better understanding of the calculation of the BLEU Score, let us take an example. Following is a case for French to English Translation:

Source Text (French): cette image est cliqué par moi
Machine Translated Text: the picture by me
Reference Text-1: this picture is clicked by me
Reference Text-2: the picture was clicked by me

We can clearly see that the translation done by the machine is not accurate. Let’s calculate the BLEU score for the translation.

For n = 1, we’ll calculate the Unigram Modified Precision:

Unigram	Count	Clipped Count
the	2	1
picture	2	1
by	1	1
me	1	1

Here the unigrams (the, picture, by, me) are taken from the machine-translated text. Count refers to the frequency of n-grams in all the Machine Translated Text, and Clipped Count refers to the frequency of unigram in the reference texts collectively.

$P_1 = \frac{1+1+1+1}{2+2+1+1} =0.666$

For n = 2, we’ll calculate the Bigram Modified Precision:

Bigrams	Count	Clipped Count
the picture	2	1
picture the	1	0
picture by	1	0
by me	1	1

$P_2 = \frac{1+0+0+1}{2+1+1+1} =0.4$

For n = 3, we’ll calculate the Trigram Modified Precision:

Trigram	Count	Clipped Count
the picture the	1	0
picture the picture	1	0
the picture by	1	0
picture by me	1	0

$P_3 = \frac{0+0+0+0}{1+1+1+1} =0.0$

For n =4, we’ll calculate the 4-gram Modified Precision:

4-gram	Count	Clipped Count
the picture the picture	1	0
picture the picture by	1	0
the picture by me	1	0

$P_4 = \frac{0+0+0}{1+1+1} =0.0$

Now we have computed all the precision scores, let’s find the Brevity Penalty for the translation:

$Brevity Penalty = min(1, \frac{Machine\,Translation\,Output\,Length}{Maximum\,Reference\,Output\,Length})$

Machine Translation Output Length = 7 (Machine Translated Text: the picture the picture by me)
Max Reference Output Length = 7 (Reference Text-2: the picture was clicked by me)

$Brevity Penalty (BP) = min(1, \frac{7}{7}) = 1$

Finally, the BLEU score for the above translation is given by:

$BLEU Score = BP * exp(\sum_{n=1}^{4} \frac{1}{n} P_n)$

On substituting the values, we get,

$BLEU Score = 1 * exp(\frac{1}{1} * 0.666 + \frac{1}{2} * 0.4 + \frac{1}{3} * 0 + \frac{1}{4} * 0)$

$BLEU Score = 0.0$

Finally, we have calculated the BLEU score for the given translation.

BLEU Score Implementation in Python

Having calculated the BLEU Score manually, one is by now accustomed to the mathematical working of the BLEU score. However, Python’s NLTK provides an in-built module for BLEU score calculation. Let’s calculate the BLEU score for the same translation example as above but this time using NLTK.

Installation Requirements:

pip install datasets transformers[sentencepiece]

Code:

Python3

from datasets import load_metric 
  
bleu = load_metric("bleu") 
predictions = [["the", "picture", "the", "picture", 
                "by", "me"]] 
references = [ 
    [["the", "picture", "is", "clicked", "by", "me"],  
     ["this", "picture", "was", "clicked", "by", "me"]] 
] 
print(bleu.compute(predictions=predictions, references=references)) 

Output:

{'bleu': 0.0, 
 'precisions': [0.6666666666666666, 0.4, 0.0, 0.0], 
 'brevity_penalty': 1.0, 
 'length_ratio': 1.0, 
 'translation_length': 6, 
 'reference_length': 6}

We can see that the BLEU score computed using Python is the same as the one computed manually. Thus, we have successfully calculated the BLEU score and understood the mathematics behind it.

NLP – BLEU Score for Evaluating Neural Machine Translation – Python

BLEU Score

Mathematical Expression for BLEU Score

BLEU Score Implementation in Python

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

How To Install PHP 8.2 on Ubuntu 22.04|20.04|18.04

Recent Comments

EDITOR PICKS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR POSTS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR CATEGORY

ABOUT US

FOLLOW US