Hacks to perform faster Text Mining in R

23 July 2024

1

Introduction

Data science demands versatility. Move away from your regular methods, challenge your ways of working, explore new ways of doing things more efficiently. On reminiscing about my old days, my initial years in data science, I had also got trapped by this devil of ‘complacency’. At one point, I was not challenging myself enough. I wasn’t experimenting with the ways of doing work. I accepted the things as they were, until I realized ‘Complacency is a state of mind that exists only in retrospective: it has to be shattered before being ascertained’. Now, whenever possible, I try to challenge my ways of working with a purpose of doing it faster and more efficient. It helps me to discover new ways of working in data science.

Text Mining, is one of the most frequent yet challenging exercise faced by beginners in data science / analytics experts. The biggest challenge is one needs to thoroughly assess the underlying patterns in text, that too manually. For example: it is pretty common to delete numbers from the text before we do any kind of text mining. But what if we want to extract something like “24/7”. Hence, the text cleansing exercise is highly personalized as per the objective of the exercise and the type of text patterns.

Majorly, we work on two aspects of Text Mining:

Sentiment Mining: Here, we are more concerned about deciphering the sentiment of the author.
Subject Extraction: Here, we wish to pull out the main subject of the chosen speech. This is done prior to sentiment mining.

You may find numerous ways on internet to do sentiment analysis. However, subject extraction is very specific to the context. In this article, I have shared the top 4 hacks applied in the industry to do subject extraction in R. For ease, I’ve also highlighted the strength and weakness associated with each trick.

Top 4 Hacks in R

1. Keyword Match Algorithm

This is the most powerful tool to do text mining. Let’s first look at the code in R to execute this step

ss <- read.csv("keywords.csv")

#Import the list of Keywords with first column as the keyword you wish to match and the tag you need to populate
Keywords <- as.character(ss$Keywords)

tags <- as.character(ss$Merchant_Name)

for (i in 1:length(Keywords)) {
for (j in 1:nrow(Data1)) {

#Data1 is the complete data from which you are trying to extract the text. We will look at the text line by line
if(grepl(Keywords[i],Data1[j,1]) == 1){Data1[j,2] <- tags[i]

#Here is where you do an actual search
Data1[j,4] <- 1

#Flag 1 to those observations where you find a match
}
}
}

Now let’s try to see the strengths and weaknesses of this algorithm.

Strengths

It is highly effective in extracting keywords from not so well separated words. For instance, this algorithm can pull out “Tavish” from “#DataScientistTavishSrivatava”.
This algorithm has the option of assigning priority order in the keyword match algorithm. For instance, if I need to give “Tavish” higher priority than “Srivastava” in the above hash-tag, it can easily be done.

Weaknesses

It needs a pre-defined list of keywords from where you need to search.
It can capture many mis-classified cases. For instance, if want to search “APE” from the context, you will also erroneously tag “CAPE” as “APE”.

2. Word Match Algorithm

This is the fix for the second weakness (mis-classified cases) in the previous algorithm. In this algorithm, we try to match words instead of keywords. Here is the R-code :

words <- read.csv("word_match.csv")
word <- as.character(words$Keywords)
tags <- as.character(words$Tag)

for (i in 1:length(word)) {
for (j in 1:nrow(Data1)) {
if(word(unlist(Data1[j,1]),1) == word[i]){Data1[j,2] <- tags[i]
Data1[j,4] <- 1
}
}
}

Strengths

It operates perfectly on finding well separated words. For instance, this algorithm can effortlessly pull out “Tavish” from “Tavish Srivatava”.
This algorithm also allows priority order in the word match algorithm. For instance if I need to give “Tavish” higher priority than “Srivastava” in the above hash-tag, it can easily be executed.

Weaknesses

It needs a pre-defined list of keywords from where you need to search.
It only captures the first well separated word. The algorithm can be modified to search among all words though.
It misses out not on so well separated words.

3. General Expressions

This methods needs extensive research on the sentence structures. For ease of understanding, I’ve taken an uncomplicated example of “www.dummyvalue.com”. Here is the code :

for (i in 1:nrow(Data1)) {
if(grepl("WWW",Data1[i,1]) == 1 & grepl("COM",Data1[i,1]) == 1){
start <- str_locate(unlist(Data1[i,1]),"WWW")[2]
end <- str_locate(unlist(Data1[i,1]),"CO")[1]
Data1[i,2] <- paste("www",tolower(substr(unlist(Data1[i,1]),start + 1,end-1)),"com", sep = ".")
Data1[i,4] <- 1}
}

Strengths

It does not need any kind of list to start with.
Usually, it turns out to be highly accurate if you are able to find out a strong regular expression.

Weaknesses

It needs deep research to create a regular expression.
In case of a not so well structured data, this method is able to tag a very small number of observation

4. Word Association:

I bet, this method is good enough to challenge you intellectually. So, that you could work on it, instead of giving away the entire code, I’ve provided the step by step methods to do the same. If you still find it difficult, mention your request for code in the comment section below.

Step 1: Find most frequent words which can possibly be something what you are looking for.

Step 2: Find the most associated word with these frequently occurring words.

Step 3: For each of the pairs find the best frequency-association pair (this will need some number of iterations)

Strengths

No dictionary is required.
If parameters are optimized well, it can be highly predictive.
It can act as a feedback to other algorithms.
You can use this algorithm even if you don’t know the language of the text.

Weaknesses

It sometimes is not very precise on the subject name. It tends to capture even those trends which does not mean anything significant.

End Notes

Hope you find these 4 hacks useful enough to speed up your text mining process. I’d encourage you to take a shot on the last algorithm code and share it in the comment box below. This list is no way exhaustive of what all can be done in subject extraction.

All these algorithms can be used together on the same text to boost up the performance. However, in those cases you need to create decision points of when to use which algorithm.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Tavish Srivastava

11 Dec 2015

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Big data Business Analytics Intermediate NLP R

Hacks to perform faster Text Mining in R

Introduction

Top 4 Hacks in R

1. Keyword Match Algorithm

2. Word Match Algorithm

3. General Expressions

4. Word Association:

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US