Natural language processing (NLP) including conversational AI is arguably one of the most exciting technology fields today. NLP is important because it works to resolve ambiguity in language and adds useful analytical structure to the data for a plethora of downstream applications such as speech recognition and text analytics. NLP helps computers communicate with humans in their own language and scales other language-centric tasks. For example, NLP makes it possible for computers to read text, listen to speech, interpret conversations, measure sentiment, and determine which segments are important. Even though budgets were hit hard by the pandemic, 53% of technical leaders said their NLP budget was at least 10% higher compared to 2019. In addition, many NLP breakthroughs are moving from research to production, with much coming from recent NLP research.
The last couple of years have been big for NLP with a number of high-profile research efforts involving: generative pre-training model (GPT), transfer learning, transformers (e.g. BERT, ELMO), multilingual NLP, training models with reinforcement learning, automating customer service with a new era of chatbots, NLP for social media monitoring, fake news detection, and so much more.
In this article, I’ll help get you up to speed with current NLP research efforts by curating a list of the top recent papers published with a variety of research destinations including: arXiv.org, The International Conference on Learning Representations (ICLR), The Stanford NLP Group, NeurIPS, and KDD. Enjoy!
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Increasing model size when pretraining natural language representations often result in improved performance on downstream tasks. However, at some point, further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, this paper presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that the proposed methods lead to models that scale much better compared to the original BERT. Also used is a self-supervised loss that focuses on modeling inter-sentence coherence, and shows it consistently helps downstream tasks with multi-sentence inputs. As a result, the best model from this NLP research establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The GitHub repo associated with this paper can be found HERE.
CogLTX: Applying BERT to Long Texts
BERTs are incapable of processing long texts due to quadratically increasing memory and time consumption. The attempts to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The limited text length of BERT reminds us of the limited capacity (5∼ 9 chunks) of the working memory of humans – then how do human beings “Cognize Long TeXts?” Founded on the cognitive theory stemming from Baddeley, the CogLTX framework described in this NLP research paper identifies key sentences by training a judge model, concatenates them for reasoning, and enables multi-step reasoning via rehearsal and decay. Since relevance annotations are usually unavailable, it is proposed to use treatment experiments to create supervision. As a general algorithm, CogLTX outperforms or gets comparable results to SOTA models on NewsQA, HotpotQA, multi-class, and multi-label long-text classification tasks with memory overheads independent of the text length.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, this NLP research paper proposes a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, the new approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the new approach trains a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by this approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. This NLP research paper explores a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. RAG models are introduced where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. Two RAG formulations are compared, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token.
ConvBERT: Improving BERT with Span-based Dynamic Convolution
Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers a large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, some heads only need to learn local dependencies, which means the existence of computation redundancy. This NLP research paper proposes a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. BERT is equipped with this mixed attention design using a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training costs and fewer model parameters.
The Lottery Ticket Hypothesis for Pre-trained BERT Networks
In NLP, enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. The work in this paper combines these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, matching subnetworks at 40% to 90% sparsity is found. These subnetworks are found at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, the results demonstrate that the main lottery ticket observations remain relevant in this context. The GitHub repo associated with this paper can be found HERE.
BERT Loses Patience: Fast and Robust Inference with Early Exit
This NLP research paper proposes Patience-based Early Exit, a straightforward yet effective inference method that can be used as a plug-and-play technique to simultaneously improve the efficiency and robustness of a pretrained language model (PLM). To achieve this, the approach couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers do not change for a pre-defined number of steps. The approach improves inference efficiency as it allows the model to predict with fewer layers. Meanwhile, experimental results with an ALBERT model show that the method can improve the accuracy and robustness of the model by preventing it from overthinking and exploiting multiple classifiers for prediction, yielding a better accuracy-speed trade-off compared to existing early exit methods.
The Curious Case of Neural Text Degeneration
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as a training objective leads to high-quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. This NLP research paper reveals surprising distributional differences between human text and machine text. In addition, it’s found that decoding strategies alone can dramatically affect the quality of machine text, even when generated from exactly the same neural language model. The findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
Encoding word order in complex embeddings
Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. This NLP research paper presents a novel and principled solution for modeling both the global absolute positions of words and their order relationships. The solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions is extended to a complex-valued domain due to richer representations. CNN, RNN, and Transformer NNs are extended to complex-valued versions to incorporate complex embedding.
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
This paper introduces Stanza , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. Stanza was trained on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. The GitHub repo associated with this NLP research paper, along with source code, documentation, and pretrained models for 66 languages can be found HERE.
Mogrifier LSTM
Many advances in NLP have been based upon more expressive models for how inputs interact with the context in which they occur. Recurrent networks, which have enjoyed a modicum of success, still lack the generalization and systematicity ultimately required for modeling language. This NLP research paper proposes an extension to the venerable Long Short-Term Memory (LSTM) in the form of mutual gating of the current input and the previous output. This mechanism affords the modeling of a richer space of interactions between inputs and their context. Equivalently, the model can be viewed as making the transition function given by the LSTM context-dependent.
DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. This NLP research paper describes a new method, DeFINE, for learning deep token representations efficiently. The architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity.
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. This paper proposes a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, it is applied to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. The GitHub repo associated with this paper can be found HERE.
Dynabench: Rethinking Benchmarking in NLP
This paper introduces Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. It is argued that Dynabench addresses a critical need in the NLP community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. The paper reports on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.
Causal Effects of Linguistic Properties
This paper considers the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, formalize the causal quantity of interest as the effect of a writer’s intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, access is only offered to noisy proxies for the linguistic properties of interest—e.g., predictions from classifiers and lexicons. An estimator is proposed for this setting and proof that its bias is bounded when we perform an adjustment for the text. Based on these results, TEXTCAUSE is introduced, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. It is shown that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures.
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical/grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. This paper shows how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. This LM-Critic and BIFI is applied along with a large set of unlabeled sentences to bootstrap realistic ungrammatical/grammatical pairs for training a corrector.
Generative Adversarial Transformers
This paper introduces the GANformer, a novel and efficient type of transformer, and explores it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. The model’s strength and robustness are demonstrated through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency. The GitHub repo associated with this paper can be found HERE.
Learn More About NLP and NLP Research at ODSC West 2021
At our upcoming event this November 16th-18th in San Francisco, ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on NLP and NLP research. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on NLP and NLP research include:
- Transferable Representation in Natural Language Processing: Kai-Wei Chang, PhD | Director/Assistant Professor | UCLA NLP/UCLA CS
- Build a Question Answering System using DistilBERT in Python: Jayeeta Putatunda | Data Scientist | MediaMath
- Introduction to NLP and Topic Modeling: Zhenya Antić, PhD | NLP Consultant/Founder | Practical Linguistics Inc
- NLP Fundamentals: Leonardo De Marchi | Lead Instructor | ideai.io
Sessions on Deep Learning and Deep Learning Research:
- GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
- Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
- Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
- Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google
Sessions on Machine Learning:
- Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
- Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
- Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
- Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability, and Analytics Researcher | Northwestern University