Friday, January 3, 2025
Google search engine
HomeData Modelling & AIPart 8: Step by Step Guide to Master NLP – Useful Natural...

Part 8: Step by Step Guide to Master NLP – Useful Natural Language Processing Tasks

This article was published as a part of the Data Science Blogathon

Introduction

This article is part of an ongoing blog series on Natural Language Processing (NLP). Up to part-7 of this series, we completed the most useful concepts in NLP. While going away in this series, let’s first discuss some of the useful tasks of NLP so that you have much clarity about what you can do by learning the NLP. After this part, we will start our discussion on Syntactic and Semantic Analysis in detail including the concept of Grammar and Parsing, etc.

So, In this part of this blog series, we will discuss some of the very useful tasks of Natural Language Processing in a detailed manner.

This is part-8 of the blog series on the Step by Step Guide to Natural Language Processing.

Table of Contents

1. Text Classification

  • Sentiment Analysis
  • Fighting Spam

2. Text Matching or Similarity

  • Levenshtein Distance
  • Phonetic Matching
  • Flexible String Matching
  • Cosine Similarity

3. Machine Translation

4. Coreference Resolution

  • Text Summarization
  • Question-Answering

5. Other Important tasks of NLP

Text Classification

Text classification is one of the classical problems of NLP. This includes several examples from which some of them are mentioned below:

  • Email Spam Identification,
  • Topic classification of news,
  • Sentiment classification,
  • Organization of web pages by search engines, etc.

In simple words, text classification is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed categories. This application becomes really helpful when we work with too large data for the purpose of organizing, information filtering, and storage of data.

Typically, a natural language classifier consists of the following two parts:

  • Training
  • Prediction

Firstly the text input is processed and from preprocessed text, we create the features. Then, we give these features to our machine learning models and after learning from these features, we used that model for the prediction of the new text.

But while making such applications, one has to keep in mind that the text classification models are heavily dependent on the quality and quantity of features, so while applying any machine learning model it is always a good practice to use more and more data to train it.

NLP| text clasification | Natural Language Processing Tasks

                                              Image Source: Google Images

Sentiment Analysis

How to Build a Twitter Sentiment Analysis System (NLP) | Lionbridge AI

                                             Image Source: Google Images

Sentiment Analysis is another important application of NLP. As the name suggests, sentiment analysis is used to identify the sentiments among several documents. This analysis also helps us to identify the sentiment where the emotions are not expressed explicitly.

Mostly Product based Companies like Amazon are using sentiment analysis to identify the opinion and sentiment of their customers online. It will help these big product-based companies to understand the thinking of customers about the products and services given by that company.

So, with the help of sentiment analysis companies can judge their overall reputation from customer posts. In this manner, we can say that beyond determining simple polarity, sentiment analysis understands sentiments in context to help us better understand what is behind the expressed opinion.

Fighting Spam

Natural Language Processing Tasks | Naive Bayes : Text Classifier for Spam Detection. | by Naveen Kumar K | Medium

                                                  Image Source: Google Images

In today’s digital era, one of the most common problems is unwanted emails. This makes Spam filters more important, as it is the first line of defense against this problem.

A spam filtering system can be developed by using NLP functionality by considering the major false-positive and false-negative issues.

Existing NLP models for spam filtering

Some of the existing NLP models used for spam filtering are as follows:

N-gram Modeling

An N-Gram model is defined as an N-character slice of a longer string. In this model, we used several N-grams of different lengths simultaneously in processing and detecting spam emails.

To know more about N-Gram, refer to our previous articles.

Word Stemming

Usually, in Spammers and generators of spam emails, there is a change in one or more characters of attacking words in their spam so that they can breach content-based spam filters. Due to this reason, we can say that content-based filters are not useful if they cannot understand the meaning of the words or phrases in the email.

So, In order to eliminate such problems in spam filtering, we developed a rule-based word-stemming technique, that can match words that look alike and sound alike.

Bayesian Classification

This method for make spam filters has now become a widely-used technology. In this, we measured the incidence of the words in an email against their typical occurrence in a database of spam and ham(not spam) email messages using a statistical technique.

Text Matching or Similarity

Matching text objects to find similarity is one of the important areas of NLP. Some of the important applications of text matching are as follows:

  • Automatic Spelling Correction,
  • Data de-duplication,
  • Genome analysis, etc.

Based on the requirement, there is a number of text-matching techniques available but in this article, we describe only the important techniques in a detailed manner:

Levenshtein Distance

In between two strings, the Levenshtein distance is defined as the minimum number of edits required to transform one string into the other, with the allowable edit operations such as

  • Insertion,
  • Deletion, or
  • Substitution of a single character.

Phonetic Matching

A Phonetic matching algorithm takes a keyword as input (such as a person’s name, location name, etc) and generates a character string that identifies a set of words that are (roughly) phonetically similar. Some of the very useful application or examples of this are:

  • For searching large text Corpuses(corpora),
  • Correcting spelling errors,
  • Matching relevant names, etc.

The two main algorithms which we can use for the above purpose are as follows:

Flexible String Matching

A complete text matching system includes different algorithms pipelined together to compute a variety of text variations. Regular expressions are really helpful for this purpose as well. Some other common techniques include are

  • Exact string matching,
  • Lemmatized matching,
  • Compact matching (takes care of spaces, punctuation, etc).

Cosine Similarity

When the text is represented as vector notation, a general cosine similarity can be applied in order to measure vectorized similarity. Cosine similarity provides the closeness among two texts.

Machine Translation

Machine Translation in 2020 | Smartling | NLP | Natural Language Processing Tasks

                                                    Image Source: Google Images

Machine Translation is an automatic system that translates text from one human language to another by taking care of grammar, semantics, and information about the real world, etc.

In simple words, Machine Translation is the process of translating one source language or text into another language.

Let’s understand the following flowchart to understand the process of machine translation:

flowchart | machine translation NLP

                                                  Image Source: Google Images

Types of Machine Translation Systems

Mainly, there are two different types of machine translation systems.

Bilingual Machine Translation System

These systems produce translations between two particular languages.

Multilingual Machine Translation System

These systems produce translations between any pair of languages. They can be either uni-directional in nature or bi-directional in nature.

Approaches to Machine Translation (MT)

Let’s now discuss some of the important approaches to Machine Translation.

Direct Approach

It is the oldest approach of Machine Translation, so it is less popular. The systems that use this approach are capable of translating the source language directly to the target language. Such systems are bi-lingual and uni-directional in nature.

Interlingua Approach

The systems that use the Interlingua approach translate firstly Source language to an intermediate language, known as Interlingua (IL), and then translate Interlingua to Target Language. Below is the Machine Translation Pyramid which helps us to understand the Interlingua approach:

Interlingua Approach | NLP

                                                Image Source: Google Images

Transfer Approach

There are three stages involved in this approach of Machine translation:

  • In the first stage, source language texts are converted to abstract Source Language -oriented representations.
  • In the second stage, Source Language-oriented representations are converted into equivalent target language-oriented representations.
  • In the third stage, the final text is produced.

Empirical MT Approach

This is an emerging approach for Machine Translation. Basically, this approach uses a large amount of raw data in the form of parallel corpora. Here, the raw data includes text and its translations. The following machine translation techniques used this approach:

  • Analogybased,
  • Example-based,
  • Memory-based.

Coreference Resolution

Unraveling Coreference Resolution in NLP! – Towards AI — The Best of Tech, Science, and Engineering

                                                  Image Source: Google Images

It is the process of finding relational links among the words (or phrases) within the sentences. Consider the following sentence:

”Chirag went to Kshitiz’s office to see the new pen. He looked at it for an hour.“

After observing the above sentence, humans can easily figure out that “he” denotes Chirag (and not Kshitiz), and that “it” denotes the pen (and not Kshitiz’s office).

So, Coreference Resolution is the component of NLP that does this job automatically.

It is used in many applications including:

  • Automatic Text Summarization,
  • Question answering,
  • Information extraction, etc.

For commercial purposes, Stanford CoreNLP provides a python wrapper.

Automatic Text Summarization

Simple Text Summarizer Using Extractive Method - DZone AI

                                                       Image Source: Google Images

Text Summarization: Given a text article or paragraph, summarize it automatically to produce the most important and relevant sentences in order.

In this digital era, the most valuable thing is data or information. Now, the question that comes to mind is:

Do we really get some useful and required amount of Information?

‘NO’, as the information is overloaded and our access to knowledge and information far exceeds our capacity to understand it. So, we are in serious need of automatic text summarization and information as the flood of information over the internet is not going to stop.

So, In simple words, we can say that text summarization is the technique to create a short, and accurate summary of longer text documents. It will help us to extract the relevant information in less amount of time. Therefore, NLP plays an important role in developing an automatic text summarization.

Question-Answering

Question Answering Technology for Pinpointing Answers to a Wide Range of Questions | NTT Technical Review

                                                      Image Source: Google Images 

Question-answering is another important task of NLP. Search engines put the information of the world at our fingertips, but they are still lacking when it comes to answering the questions that are asked by human beings in their own natural language. Since this is a good application to work with, so Big tech companies like Google, IBM, Microsoft, are also working in this direction.

In the Computer Science discipline, Question-answering is within the fields of AI and NLP. The aim of this type of application is to build systems that automatically answer the questions asked by human beings in their own natural language. A computer system that understands the natural language has the capability of a program system to translate the sentences written by humans into an internal representation so that valid answers can be generated by the system. It generates exact answers by doing syntax and semantic analysis of the questions. But there are some challenges that the NLP faced while building a good question answering system such as,

  • Lexical gap,
  • Ambiguity,
  • Multilingualism, etc.

Other Important NLP tasks

Some other important tasks of NLP are as follows:

Natural Language Generation and Understanding

Natural Language Generation is the process of converting information from computer databases or semantic intents into a language that is easily readable by humans.

Natural Language Understanding is the process of converting chunks of text into more logical structures that are easier for computer programs to manipulate.

Optical Character Recognition

In Optical Character Recognition (OCR), given an image that represents printed text, we have to determine the corresponding text.

Document to Information

This includes parsing textual data that is present in documents such as websites, files, and images to an analyzable and clean format.

This ends our Part-8 of the Blog Series on Natural Language Processing!

Other Blog Posts by Me

You can also check my previous blog posts.

Previous Data Science Blog posts.

LinkedIn

Here is my Linkedin profile in case you want to connect with me. I’ll be happy to be connected with you.

Email

For any queries, you can mail me on Gmail.

End Notes

Thanks for reading!

I hope that you have enjoyed the article. If you like it, share it with your friends also. Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you. 😉

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

CHIRAG GOYAL

22 Jun 2021

I am currently pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence. Feel free to connect with me on Linkedin.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments