In past blog posts, we discussed different models, objective functions, and hyperparameter choices that allow us to learn accurate word embeddings. However, these models are generally restricted to capture representations of words in the language they were trained on. The availability of resources, training data, and benchmarks in English leads to a disproportionate focus on the English language and a negligence of the plethora of other languages that are spoken around the world.
In our globalised society, where national borders increasingly blur, where the Internet gives everyone equal access to information, it is thus imperative that we do not only seek to eliminate bias pertaining to gender or race inherent in our representations, but also aim to address our bias towards language.
To remedy this and level the linguistic playing field, we would like to leverage our existing knowledge in English to equip our models with the capability to process other languages.
Perfect machine translation (MT) would allow this. However, we do not need to actually translate examples, as long as we are able to project examples into a common subspace such as the one in Figure 1.
Ultimately, our goal is to learn a shared embedding space between words in all languages. Equipped with such a vector space, we are able to train our models on data in any language. By projecting examples available in one language into this space, our model simultaneously obtains the capability to perform predictions in all other languages (we are glossing over some considerations here; for these, refer to this section). This is the promise of cross-lingual embeddings.
Over the course of this blog post, I will give an overview of models and algorithms that have been used to come closer to this elusive goal of capturing the relations between words in multiple languages in a common embedding space.
Note: While neural MT approaches implicitly learn a shared cross-lingual embedding space by optimizing for the MT objective, we will focus on models that explicitly learn cross-lingual word representations throughout this blog post. These methods generally do so at a much lower cost than MT and can be considered to be to MT what word embedding models (word2vec, GloVe, etc.) are to language modelling.
Types of cross-lingual embedding models
In recent years, various models for learning cross-lingual representations have been proposed. In the following, we will order them by the type of approach that they employ.
Note that while the nature of the parallel data used is equally discriminatory and has been shown to account for inter-model performance differences [1], we consider the type of approach more conducive to understanding the assumptions a model makes and — consequently — its advantages and deficiencies.
Cross-lingual embedding models generally use four different approaches:
- Monolingual mapping: These models initially train monolingual word embeddings on large monolingual corpora. They then learn a linear mapping between monolingual representations in different languages to enable them to map unknown words from the source language to the target language.
- Pseudo-cross-lingual: These approaches create a pseudo-cross-lingual corpus by mixing contexts of different languages. They then train an off-the-shelf word embedding model on the created corpus. The intuition is that the cross-lingual contexts allow the learned representations to capture cross-lingual relations.
- Cross-lingual training: These models train their embeddings on a parallel corpus and optimize a cross-lingual constraint between embeddings of different languages that encourages embeddings of similar words to be close to each other in a shared vector space.
- Joint optimization: These approaches train their models on parallel (and optionally monolingual data). They jointly optimise a combination of monolingual and cross-lingual losses.
In terms of parallel data, methods may use different supervision signals that depend on the type of data used. These are, from most to least expensive:
- Word-aligned data: A parallel corpus with word alignments that is commonly used for machine translation; this is the most expensive type of parallel data to use.
- Sentence-aligned data: A parallel corpus without word alignments. If not otherwise specified, the model uses the Europarl corpus consisting of sentence-aligned text from the proceedings of the European parliament that is generally used for training Statistical Machine Translation models.
- Document-aligned data: A corpus containing documents in different languages. The documents can be topic-aligned (e.g. Wikipedia) or label/class-aligned (e.g. sentiment analysis and multi-class classification datasets).
- Lexicon: A bilingual or cross-lingual dictionary with pairs of translations between words in different languages.
- No parallel data: No parallel data whatsoever. Learning cross-lingual representations from only monolingual resources would enable zero-shot learning across languages.
To make the distinctions clearer, we provide the following table, which serves equally as the table of contents and a springboard to delve deeper into the different cross-lingual models:
After the discussion of cross-lingual embedding models, we will additionally look into how to incorporate visual information into word representations, discuss the challenges that still remain in learning cross-lingual representations, and finally summarize which models perform best and how to evaluate them.
Monolingual mapping
Methods that employ monolingual mapping train monolingual word representations independently on large monolingual corpora. They then seek to learn a transformation matrix that maps representations in one language to the representations of the other language. They usually employ a set of source word-target word pairs that are translations of each other, which are used as anchor words for learning the mapping.
Note that all of the following methods presuppose that monolingual embedding spaces have already been trained. If not stated otherwise, these embedding spaces have been learned using the word2vec variants, skip-gram with negative sampling (SGNS) or continuous bag-of-words (CBOW) on large monolingual corpora.
Linear projection
Mikolov et al. have popularised the notion that vector spaces can encode meaningful relations between words. In addition, they notice that the geometric relations that hold between words are similar across languages [2], e.g. numbers and animals in English show a similar geometric constellation as their Spanish counterparts in Figure 2.
This suggests that it might be possible to transform one language’s vector space into the space of another simply by utilising a linear projection with a transformation matrix (W).
In order to achieve this, they translate the 5,000 most frequent words from the source language and use these 5,000 translations pairs as bilingual dictionary. They then learn (W) using stochastic gradient descent by minimising the distance between the previously learned monolingual representations of the source word (w_i) that is transformed using (W) and its translation (z_i) in the bilingual dictionary:
(minlimits_W sumlimits^n_{i=1} |Wx_i – z_i|^2 ).
Projection via CCA
Faruqui and Dyer [3] propose to use another technique to learn the linear mapping. They use canonical correlation analysis (CCA) to project words from two languages into a shared embedding space. Different to linear projection, CCA learns a transformation matrix for every language, as can be seen in Figure 3, where the transformation matrix (V) is used to project word representations from the embedding space (Sigma) to a new space (Sigma^ast), while (W) transforms words from (Omega) to (Omega^ast). Note that (Sigma^ast) and (Omega^ast) can be seen as the same shared embedding space.
Similar to linear projection, CCA also requires a number of translation pairs in (Sigma’) and (Omega’) whose correlation can be maximised. Faruqui and Dyer obtain these pairs by selecting for each source word the target word to which it has been aligned most often in a parallel corpus. Alternatively, they could have also used a bilingual dictionary.
As CCA sorts the correlation vectors in (V) and (W) in descending order, Faruqui and Dyer perform experiments using only the top (k) correlated projection vectors and find that using the (80) % projection vectors with the highest correlation generally yields the highest performance.
Interestingly, they find that using multilingual projection helps to separate synonyms and antonyms in the source language, as can be seen in Figure 4, where the unprotected antonyms of “beautiful” are in two clusters in the top, whereas the CCA-projected vectors of the synonyms and antonyms form two distinct clusters in the bottom.
Normalisation and orthogonal transformation
Xing et al. [33] notice inconsistencies in the linear projection method by Mikolov et al. (2013), which they set out to resolve. Recall that Mikolov et al. initially learn monolingual word embeddings. For this, they use the skip-gram objective, which is the following:
(dfrac{1}{N} sumlimits_{i=1}^N sumlimits_{-C leq j leq C, j neq 0} text{log} P(w_{i+j} :|: w_i) )
where (C) is the context length and (P(w_{i+j} :|: w_i)) is computed using the softmax:
(P(w_{i+j} :|: w_i) = dfrac{text{exp}(c_{w_{i+j}}^T c_{w_i})}{sum_w text{exp}(c_w^T c_{w_i})}).
They then learn a linear transformation between the two monolingual vector spaces with:
(text{min} sumlimits_i |Wx_i – z_i|^2 )
where (W) is the projection matrix that should be learned and (x_i) and (z_i) are word vectors in the source and target language respectively that are similar in meaning.
Xing et al. argue that there is a mismatch between the objective function used to learn word representations (maximum likelihood based on inner product), the distance measure for word vectors (cosine similarity), and the objective function used to learn the linear transformation (mean squared error), which may lead to degradation in performance.
They subsequently propose a method to resolve each of these inconsistencies: In order to fix the mismatch between the inner product similarity measure (c_w^T c_{w’}) during training and the cosine similarity measure (dfrac{c_w^T c_w’}{|c_w| |c_{w’}|}) for testing, the inner product could also be used for testing. Cosine similarity, however, is used conventionally as an evaluation measure in NLP and generally performs better than the inner product. For this reason, they propose to normalise the word vectors to be unit length during training, which makes the inner product the same as cosine similarity and places all word vectors on a hypersphere as a side-effect, as can be seen in Figure 5.
They resolve the inconsistency between the cosine similarity measure now used in training and the mean squared error employed for learning the transformation by replacing the mean squared error with cosine similarity for learning the mapping, which yields:
(maxlimits_W sumlimits_i (Wx_i)^T z_i ).
Finally, in order to also normalise the projected vector (Wx_i) to be unit length, they constrain (W) to be an orthogonal matrix by solving a separate optimisation problem.
Max-margin and intruders
Lazaridou et al. [28] identify another issue with the linear transformation objective of Mikolov et al. (2013): They discover that using least-squares as objective for learning a projection matrix leads to hubness, i.e. some words tend to appear as nearest neighbours of many other words. To resolve this, they use a margin-based (max-margin) ranking loss (Collobert et al. [34]) to train the model to rank the correct translation vector (y_i) of a source word (x_i) that is projected to (hat{y_i}) higher than any other target words (y_j):
(sumlimits^k_{jneq i} max { 0, gamma + cos(hat{y_i}, y_i) – cos(hat{y_i}, y_j) } )
where (k) is the number of negative examples and (gamma) is the margin.
They show that selecting max-margin over the least-squares loss consistently improves performance and reduces hubness. In addition, the choice of the negative examples, i.e. the target words compared to which the model should rank the correct translation higher, is important. They hypothesise that an informative negative example is an intruder (“truck” in the example), i.e. it is near the current projected vector (hat{y_i}) but far from the actual translation vector (y_i) (“cat”) as depicted in Figure 6.
These intruders should help the model identify cases where it is failing considerably to approximate the target function and should thus allow it to correct its behaviour. At every step of gradient descent, they compute (s_j = cos(hat{y_i}, y_j) – cos(y_i, y_j) ) for all vectors (y_t) in the target embedding space with (j neq i) and choose the vector with the largest (s_j) as negative example for (x_i). Using intruders instead of random negative examples yields a small improvement of 2 percentage points on their comparison task.
Alignment-based projection
Guo et al. [4] propose another projection method that solely relies on word alignments. They count the number of times each word in the source language is aligned with each word in the target language in a parallel corpus and store these counts in an alignment matrix (mathcal{A}).
In order to project a word (w_i) from its source representation (v(w_i^S)) to its representation in the target embedding space (v(w_i)^T) in the target embedding space, they simply take the average of the embeddings of its translations (v(w_j)^T) weighted by their alignment probability with the source word:
(v(w_i)^T = sumlimits_{i, j in mathcal{A}} dfrac{c_{i, j}}{sum_j c_{i,j}} cdot v(w_j)^T)
where (c_{i,j}) is the number of times the (i^{th}) source word has been aligned to the (j^{th}) target word.
The problem with this method is that it only assigns embeddings for words that are aligned in the reference parallel corpus. Gou et al. thus propagate alignments from in-vocabulary to OOV words by using edit distance as a metric for morphological similarity. They set the projected vector of an OOV source word (v(w_{OOV}^T)) as the average of the projected vectors of source words that are similar to it in edit distance:
(v(w_{OOV}^T) = Avg(v(w_T)))
where (C = { w :|: EditDist(w_{OOV}^T, w) leq tau } ). They set the threshold (tau) empirically to (1).
Even though this approach seems simplistic, they actually observe significant improvements over projection via CCA in their experiments.
Multilingual CCA
Ammar et al. [5] extend the bilingual CCA projection method of Faruqui and Dyer (2014) to the multi-lingual setting using the English embedding space as the foundation for their multilingual embedding space.
They learn the two projection matrices for every other language with English. The transformation from each target language space (Omega) to the English embedding space (Sigma) can then be obtained by projecting the vectors in (Omega) into the CCA space (Omega^ast) using the transformation matrix (W) as in Figure 3. As (Omega^ast) and (Sigma^ast) lie in the same space, vectors in (Sigma^ast) can be projected into the English embedding space (Sigma) using the inverse of (V).
Hybrid mapping with symmetric seed lexicon
The previous mapping approaches used a bilingual dictionary as inherent component of their model, but did not pay much attention to the quality of the dictionary entries, using either automatic translations of frequent words or word alignments of all words.
Vulić and Korhonen [6] in turn emphasise the role of the seed lexicon that is used for learning the projection matrix. They propose a hybrid model that initially learns a first shared bilingual embedding space based on an existing cross-lingual embedding model. They then use this initial vector space to obtain translations for a list of frequent source words by projecting them into the space and using the nearest neighbour in the target language as translation. With these translation pairs as seed words, they learn a projection matrix analogously to Mikolov et al. (2013).
In addition, they propose a symmetry constraint, which enforces that words are only included if their projections are neighbours of each other in the first embedding space. Additionally, one can retain pairs whose second nearest neighbours are less similar than the first nearest neighbours up to some threshold.
They run experiments showing that their model with the symmetry constraint outperforms comparison models and that a small threshold of (0.01) or (0.025) leads to slightly improved performance.
Orthogonal transformation, normalisation, and mean centering
The previous approaches have introduced models that imposed different constraints for mapping monolingual representations of different languages to each other. The relation between these methods and constraints, however, is not clear.
Artetxe et al. [32] thus propose to generalise previous work on learning a linear transformation between monolingual vector spaces: Starting with the basic optimisation objective, they propose several constraints that should intuitively help to improve the quality of the learned cross-lingual representations. Recall that the linear transformation learned by Mikolov et al. (2013) aims to find a parameter matrix (W) that satisfies:
(DeclareMathOperator*{argmin}{argmin} argminlimits_W sumlimits_i |Wx_i – z_i|^2 )
where (x_i) and (z_i) are similar words in the source and target language respectively.
If the performance of the embeddings on a monolingual evaluation task should not be degraded, the dot products need to be preserved after the mapping. This can be guaranteed by requiring (W) to be an orthogonal matrix.
Secondly, in order to ensure that all embeddings contribute equally to the objective, embeddings in both languages can be normalised to be unit vectors:
(argminlimits_W sumlimits_i | W dfrac{x_i}{|x_i|} – dfrac{z_i}{|z_i|}|^2 ).
As the norm of an orthogonal matrix is (1), if (W) is orthogonal, we can add it to the denominator and move (W) to the numerator:
(argminlimits_W sumlimits_i | dfrac{Wx_i}{|Wx_i|} – dfrac{z_i}{|z_i|}|^2 ).
Through expansion of the above binomial, we obtain:
(argminlimits_W sumlimits_i |dfrac{Wx_i}{|Wx_i|}|^2 + |dfrac{z_i}{|z_i||}|^2 – 2 dfrac{Wx_i}{|Wx_i|}^T dfrac{z_i}{|z_i|} ).
As the norm of a unit vector is (1) the first two terms reduce to (1), which leaves us with the following:
(argminlimits_W sumlimits_i 2 – 2 dfrac{Wx_i}{|Wx_i|}^T dfrac{z_i}{|z_i|} ) ).
The latter term now is just the cosine similarity of (Wx_i) and (z_i):
(argminlimits_W sumlimits_i 2 – 2 : text{cos}(Wx_i, z_i) ).
As we are interested in finding parameters (W) that minimise our objective, we can remove the constants above:
(argminlimits_W sumlimits_i – : text{cos}(Wx_i, z_i) ).
Minimising th