A lot goes into NLP. Languages, dialects, unstructured data, and unique business needs all contribute to requiring constant innovation from the field. Going beyond NLP platforms and skills alone, having expertise in novel processes and staying afoot in the latest research are becoming pivotal for effective NLP implementation. We looked at a number of NLP sessions coming to ODSC East 2022 this April 19th-21st that highlight change in the growing field and to perform NLP better.
1. The Appeal of Named-Entity Recognition Models for NLP
Named Entity Recognition (NER) and Relationship Extraction (RE) are foundational for many downstream NLP tasks such as Information Retrieval and Knowledge Base construction. While pre-trained models exist for both NER and RE tasks, they are usually specialized for some narrow application domain. If your application domain is different, your best bet is to train your own models. However, the costs associated with training, specifically generating training data, can be a significant deterrent for doing so. Fortunately, Language Models learned by pre-trained Transformers learn a lot about the language of the domain it is trained and fine-tuned on, and therefore NER and RE models based on these Language Models require fewer training examples to deliver the same level of performance to perform NLP.
Session: Transformer Based Approaches to Named Entity Recognition (NER) and Relationship Extraction (RE): Sujit Pal | Technology Research Director | Elsevier Labs
2. Improved Ease of Use for Practitioners
We are witnessing the rapid adoption of Pandas as the main library for the representation and manipulation of structured data in python. On the contrary, when it comes to Natural Language Processing (NLP) applications, we are usually using various NLP libraries with complex and more importantly incompatible output structures. That makes the integration of NLP features and solutions into the machine learning and data science pipeline difficult and time-consuming. To resolve this issue, the Center for Open Source Data and AI Technologies (CODAIT) has developed Text Extensions for Pandas, an open-source library of the extensions that turns Pandas data frames into the universal data structure for NLP and hence offers transparency, simplicity, and compatibility.
Session: Towards Data Scientist – Friendly Natural Language Processing: IBM Team
3. Building NLP at Scale
NLP has so many practical applications, and healthcare sees countless benefits, given much of its data is text-based. Despite ongoing efforts towards using NLP in information extraction from electronic health records (EHR’s), current solutions require healthcare AI practitioners to make unacceptable trade-offs between delivering state-of-the-art accuracy, generalizing over unseen data points, and preventing the sharing of personal data or intellectual property. Spark NLP for Healthcare aims to bridge this gap by providing an accurate, scalable, private, and tunable software library that helps healthcare & pharma organizations build longitudinal patient records and knowledge graphs on real-world EHR data.
Session: Spark NLP for Healthcare: Modular Approach to Solve Problems at Scale in Healthcare NLP: Veysel Kocaman, PhD | Lead Data Scientist | John Snow Labs
4. Improving Performance Via Self-Supervised Learning
Self-supervised representation learning methods promise a single universal model to benefit a collection of tasks and domains. They recently succeeded in NLP and computer vision domains, reaching new performance levels while reducing required labels for many downstream scenarios. Speech representation learning is experiencing similar progress with three main categories: generative, contrastive, predictive. Other approaches to perform NLP relied on multi-modal data for pre-training, mixing text or visual data streams with speech.
Session: Self-supervised Representation Learning for Speech Processing: Abdel-rahman | Mohamed, PhD | Research Scientist | Facebook AI Research
5. Better Identification with Unstructured Text Data
Recent advances in NLP have revolutionized the process of identifying products, people, and places in unstructured text data. This task is referred to as Named Entity Recognition (NER) and forms the basis of many downstream NLP applications (e.g. AI assistants, search engines). Where previously extracting this type of information relied on complex rules engines fit to specific data sources, open-source libraries leveraging pre-trained large language models can achieve state-of-the-art performance on even noisy social media data.
Session: Neural Named-Entity Recognition pipelines with spaCy: Benjamin Batorsky, PhD | Biomedical Data Scientist | Ciox Health
6. More Real-World Functionality for NLP is Still Needed
With notable innovations like Alexa and Siri becoming a part of our daily lives, there is a growing appreciation for NLP as it is no longer a nascent facet of AI technology. From sentiment analysis techniques in retail to named entity recognition and virtual assistants in financial markets, NLP is attaining relevance across various industries. According to recent research, the global NLP market size is expected to reach $35.1 billion by 2026. The reason is the significant shift from human-computer interaction to human-computer conversation. With AI-powered interfaces augmented with NLP, enterprises now have the ability to drive business growth with unprecedented technological advancements. However, in the vein of understanding the full scope of NLP, enterprises must be mindful of unsupervised models that generate biased outcomes and hinder the capabilities of NLP mechanisms.
Session: Natural Language Processing in Accelerating Business Growth: Sameer Maskey, PhD | Founder & CEO | Fusemachines
7. Improving Neural Machine Translation
Machine translation (MT) has become ubiquitous as a technology that enables individuals to access content on-demand in real-time that is written in languages they do not speak. However, contrary to recent press releases that have said it has surpassed human quality, the results in practice suggest that it has a long way to go. One of the biggest challenges current-generation neural MT (NMT) faces is that its engines are not easily adaptable and cannot respond to context or extra-linguistic knowledge that human translators routinely deal with. In addition, NMT’s improvements have largely been in terms of fluency (how natural the output sounds) rather than accuracy (how well the translated text represents the content of the source text). This discrepancy in improvement actually increases the risk that critical errors may remain undetected simply because they are readable and sound plausible. The next step forward is to build “responsive MT”: systems that can take advantage of embedded metadata about a wide variety of topics and use them to preferentially use the most relevant training data.
Session: How Can We Make Machine Translation Responsive and Responsible?: Arle Lommel | Senior Analyst | CSA Research
8. Better Emotion Detection
Emotion detection is an active research area in artificial intelligence. AI researchers are using facial expressions, speech signals, and textual data as inputs to detect user emotions. Among these different emotion detection sources, textual data contains the least amount of information since it does not reflect facial expressions or audio streams, which are generally more direct representations of emotion. That being said, textual data is still rich in terms of the emotional content it may contain; therefore many approaches have been proposed to detect and identify emotions from text and to perform NLP better.
Session: Emotion Detection with Natural Language Inference: Serdar Cellat, PhD | Lead Machine Learning Scientist | Y Meadows
Perform NLP Better with Training at ODSC East 2022
We just listed off quite a few skills, platforms, topics, and frameworks. It’s not expected to know every single thing mentioned above, but knowing a good chunk of them – and how to apply them in business settings – will help you get a job or become better at your current one. At ODSC East 2022, we have an entire track devoted to NLP. Learn NLP skills and platforms like the ones listed above!