As data scientists, we’ve seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular data, graph data, sensor data etc.). Yet, the vast majority of our data (Merrill Lynch puts the figure at roughly 90%) is *unstructured*, and lives in the form of documents, emails, reviews, reports, and chat logs etc. Many of us are far less familiar with how to analyze and understand this trove of unstructured data.
This talk by Alex Peattie focuses on language models, one of the most fundamental tools for working with unstructured data. Language models are all around us (although we’re probably unaware of them), underpinning everything from Word’s spellchecker to home assistants like Alexa. While plenty of “out of the box” language modeling libraries exists, the first part of the talk focuses on getting a thorough understanding of what a language model is, and how it works. We touch on key ideas from statistics and information theory, and see how Alan Turing, in developing techniques to break Nazi codes at Bletchley Park, created the smoothing techniques which remain widely used in language models today. We then proceed to the present day, looking at how techniques like word vectors and transfer learning have yielded an improved generation of tools. In the second half of the talk, we look at how we can practically use language models to understand unstructured data.
[Related Article: Why Use Continuous Intelligence in DevOps/DataOps]
Specifically, this video explores:
– Classification: the canonical application of language models, they can help us identify spam, analyze sentiment or perform unsupervised clustering. We look at a famous case where language models were able to successfully identify a Shakespeare forgery.
– Predictive modeling: if I were to look at your Tweets (and nothing else), could I guess your gender? It turns out state-of-the-art techniques can successfully predict it with an 80%+ success rate. We look at how language models can enrich your datasets with additional demographic or contextual data.
– Information retrieval: finally, we see how language models have been used extensively (for example in the legal sector), to extract targeted insights from enormous data sets.