Introduction to Word Embedding
“You shall know a word by the company it keeps,” insisted John R. Firth, a British linguist who performed pioneering work on collocational theories of semantics. What Firth meant by his 1957 quote was that interrogating the context in which a word is found offers clues to the word’s use and purpose — effectively, its meaning. Nestled within this notion is the entire field of distributional semantics, which seeks to “derive a model of meaning from observable uses of language.” By examining the various linguistic contexts that a word occupies in a large corpus of textual data, we can approximate the word’s meaning through a combination of raw word counts and statistically sound association measures. This approximation supports a number of natural language processing tasks by corralling words with similar meaning — thus enabling feature engineering for predictive NLP models and deep learning tools.
While the name “distributional semantics” encompasses the nature of this academic arena, other terminology is not uncommon. “Vector semantics” emphasizes vectors as the mathematical tool of choice for simulating relations between words — a vector being a matrix comprised of a single column or row. Words with similar meaning will be closer to each other in the vector space. In 2003, “word embedding” emerged as a related term which became increasingly popularized among the deep learning community. Over the course of this article, we will rely upon the term “word embedding” as our phrase of choice, but all of the aforementioned are viable descriptions of the work being discussed. In order to better understand modern word embedding, it’s important to know how we got to where we are today. This two-part article series will draw upon five research papers to investigate how word embedding grew from theory into a full-fledged technique whose applications span from part-of-speech tagging to automatic summarization.
The Beginnings of Word Embedding
Throughout the 1950s, Firth published a number of influential pieces, alongside contemporaries like Ludwig Wittgenstein, Zellig Harris, and Margaret Masterman. Wittgenstein wrote in 1953 that “the meaning of a word is its use in the language.” Harris noted astutely in 1954 that the words “oculist and eye-doctor…occur in almost the same environments.” Masterman’s early 1950s research group in the Cambridge Language Research Unit eschewed the dictionary and took a thesaurus-based approach to machine translation, concentrating on clusters of words. These linguistics and philosophy trailblazers shaped the theoretical grounds of word embedding, long before the advent of vectors.
Finding Feature Representations in the 1960s
The 1960s bore witness to the application of “feature representations to quantify (semantic) similarity,” typically using hand-crafted features. To take a noteworthy example, psychologist Charles Osgood introduced the Semantic Differential, which sought to map the meaning of words onto a scale of polar adjectives according to each word’s linguistic connotations. Chiefly, Osgood wanted to quantify meaning in a reproducible way. Osgood’s 1951 paper, [1] The Nature and Measurement of Meaning, gives a detailed overview of the Semantic Differential. The paper, however, first closely examines semiotics and how humans make sense of words as “signs” that elicit behavioral and psychological responses. Osgood hints at a budding idea that there is some ineffable quality about words that makes meaning so difficult to pin down — a trait that is so tough to directly observe that inferences must be made instead.
“The vast majority of signs used in ordinary communication are what we may term assigns — their meanings are literally “assigned” to them via association with other signs rather than via direct association with the objects represented,” wrote Osgood (1951).
This associative property led Osgood to the work of fellow psychologists Sigmund Freud and Carl Jung, both of whom had essentially arrived at the conclusion that word associations are never really “free,” but rather “semantically determined” (Osgood 1951). Needle and thread. Grass and green. Certain words flock together, shedding light on the minutia of meaning. Today’s word vectors even permit analogies to be made through basic mathematical operations, as discovered by Mikolov et al. in a paper that we’ll explore in the second installation of this series. What does this look like? In a meaningful vector space, calculating { vec(“green”) – vec(“grass”) + vec(“sky”) } would likely produce the vector for “blue.” The insights of Osgood, Freud, and Jung intuitively forecasted the relevance of association in discerning meaning.
Furthermore, Osgood found that increased separation between words in text or dialog was indicative of words exhibiting less influence upon one another. With this in mind, close association could be viewed as a potential proxy for meaning. Although word meaning can hypothetically differ in infinite ways, Osgood devised the Semantic Differential to be a constrained quantitative method for outlining meaning along a gradient of adjectival dimensions. Marking one of the first attempts to make the unmeasured measurable, Osgood’s Semantic Differential asserted that semantics was not beyond the scope of quantification.
A Shift Towards Automatic Feature Generation: Latent Semantic Analysis
In the 1980s, representations that could be likened to Osgood’s Semantic Differential continued to find utility in some of the preliminary artificial intelligence research. Yet closer to 1990, automatically generated contextual features appeared on the scene, introduced in a variety of concurrent but diverse models. Latent semantic analysis (LSA) was considered “one of the most influential early models,” and it paved the way for the later development of topic modeling. Here is where the vectors come in.
For insight on LSA, look no further than Thomas Hoffman’s [2] Probabilistic Latent Semantic Analysis (1999), a paper which roots the technique in statistics using a generative model and then observes its performance in a series of experiments. What LSA does is take high-dimensional vectors of word count and then use singular value decomposition of co-occurrences to map the vectors onto a “latent semantic space” of lower dimensionality. This mapping unveils meaning relations. To put it simply, words used together more frequently tend to be more closely related to one another, and LSA capitalizes upon this phenomenon. Hoffman’s paper offers scholars a deep dive into LSA, while adding a previously unseen probabilistic component that results in both a more statistically rigorous method as well as considerable performance gains.
The next part of this series will follow the development of word embedding through the remainder of the 90s onward, further enriching the discussion with three additional research papers in the field.
Stay up to date on all things data science here.