Editor’s note: Sanghamitra Deb is a speaker for ODSC East 2022. Be sure to check out her talk, “Intro to NLP: Text Categorization and Topic Modeling,” there!
Natural Language Processing (NLP) is the basis of machine intelligence. NLP is the process of bringing structure to free-form unstructured text.
At first, I will explain topic modeling which is unsupervised learning. Topic modeling is a process of recognizing hidden structures in data. Topic modeling using LDA is a generative probabilistic model. The assumption is the documents are a combination of different topics and topics are made of the underlying vocabulary.
Topic modeling is a good way to get some insights into your data in the absence of training data.
Here is a sneak peek into the topic modeling of recipes. Recipes relevant to baking are under the same topic.
Next, I will talk about text categorization/classification. This is one of the most common applications. Categories could be predefined or could be derived from topic modeling. This supervised learning requires training data. In text classification, I will go through simple techniques such as tfidf and some deep learning models.
I will conclude my session by talking about performance metrics and how to interpret them. I will also talk about real-world use cases and discuss combining numerical, categorical, and text features in the same deep learning model. Here is an example of such an architecture.
This type of modeling becomes important when we are trying to do personalized prediction and user interaction needs to be taken into account.
Lecture material will be found here: https://github.com/sangha123/Intro-to-NLP-Topic-Modeling-and-Text-Categorization
About the author/ODSC East 2022 Speaker:
Sanghamitra Deb is a Staff Data Scientist at Chegg, she works on problems related to school and college education to sustain and improve the learning process. Her work involves recommendation systems, computer vision, graph modeling, deep NLP analysis, data pipelines, and machine learning. Previously, Sanghamitra was a data scientist at Accenture where she worked on a wide variety of problems related to data modeling, architecture, and visual storytelling. She is an avid fan of python and has been programming for more than a decade.