This article is the second article in a two-part series about the history of NLP as told through the context of five research papers. It picks up in midst of the 1970s. To view the first article, click here.
Corpus resource development
The relation-driven academic era that spilled into the late 70s laid the groundwork for NLP’s grammatico-logical stage. As researchers sought logical representations of meaning and knowledge, this stage led to the development of many formal grammars, defined by Maggie Johnson and Julie Zelinski as “a set of rules by which valid sentences in a language are constructed.” These grammars also improved computational parsing due to their context-free nature and ability to generate all of the potential grammatical strings in a formalized language.
Along with the proliferation of grammars came an increase in the amount of resources available for research as well as commercial purposes, mainly due to a surge in the available machine-readable text as more of our lives were lived on the computer. In fact, this resource development continued through the end of the century, especially in the face of growing government funding. As statistical language processing gained popularity, corpus data was seen as a major boon. Massive databases like WordNet (lexical) and the Penn Treebank (syntactic) made their way to the forefront in 1985 and 1989 respectively.
[3] “Building a Large Annotated Corpus of English: The Penn Treebank,” published in 1993 by Mitchell P. Marcus et al., outlines the workflow of researchers who constructed the Penn Treebank from over 4.5 million words of American English. Containing 36 part-of-speech (POS) tags, the Treebank creators attempted to cut down on some of the tag redundancy found in the anterior Brown Corpus. This paper, though not statistically or computationally profound, is noteworthy in its thorough discussion of the numerous factors that must be accounted for in an annotation task, particularly one of this scope. The Treebank has been instrumental in a variety of research undertakings, such as parsing disambiguation and lexicon design. Its continued use is a testament to the value of corpus resources and their role in NLP.
The history of NLP : as Artificial Intelligence
Looking forward from the turn of the 21st century to where we are today, NLP has witnessed the introduction of both machine learning and deep learning as computing power expands. Modern-day computers can create robust systems and experiments quickly and efficiently. While the current best language systems tend to operate within highly specific domains, the creation of language models underpinned by probability has resulted in widespread improvements.
A particular paper that communicates the impact of AI-like advancements in the field is [4] “A Neural Probabilistic Language Model” (Bengio et al. 2003). It discusses the “curse of dimensionality,” which plagues many data-driven NLP tasks due to the seemingly infinite number of discrete variables in natural language. Even a relatively short sentence can hold many combinatorial possibilities for the sequence of words within it. Because of this, a testing set of data will inevitably differ from a training set and often to a vast degree. This raises the question of how to produce a model with generalizable properties.
The proposed solution: the language model will learn a feature vector for each word to represent similarity and will learn a probability function for word sequences as well. A neural network is used to model the probability function and backs off to a trigram model. In this case, even a never-before-seen sentence can obtain a high probability if the words in it are similar to those in a previously observed one. Combining deep learning with vector semantics, this implementation illustrates a merging of statistical language data and AI knowledge representation to develop a less brittle linguistic system.
What’s Next?
In the NLP space, the most salient dichotomy is perhaps that between formal theory and statistical data. Renowned researcher Frederick Jelinek once bitingly said, “Every time I fire a linguist, the performance of our speech recognizer goes up.”
To divorce NLP from its basis in linguistics is short-sighted, but finding the optimal balance between the intricacies of theory and the pragmatics of probability continues to elude many. Emily M. Bender and Jeff Good’s [5] “A Grand Challenge for Linguistics: Scaling Up and Integrating Models” looks at the field’s future from a perspective rooted in linguistics but with an eye for data management and the requisite cyberinfrastructure.
To state where we are going rather than where we have come from is no easy task. Nevertheless, the framework offered by Bender and Good provides insight regarding the kinds of computational tools and human collaboration we should expect if NLP is to achieve its potential in accurately modeling natural language.