Gone are the days when we used to have data mostly in row-column format, or we can say Structured data. In present times, the data being collected is more unstructured than structured. We have data in the form of text, images, audio etc and the ratio of Structured to Unstructured data has decreased over the years. Unstructured data is increasing at 55-65% every year.
Thus, we need to learn how to work with unstructured data to be able to extract relevant information from it and make it useful. While working with text data it is very important to pre-process it before using it for predictions or analysis.
In this article, we will be learning various text data cleaning techniques using python.
Let’s take a tweet for example:
I enjoyd the event which took place yesteday & I luvd it ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN
We will be performing data cleaning on this tweet step-wise.
Steps for Data Cleaning
1) Clear out HTML characters: A Lot of HTML entities like ' ,& ,< etc can be found in most of the data available on the web. We need to get rid of these from our data. You can do this in two ways:
- By using specific regular expressions or
- By using modules or packages available(htmlparser of python)
We will be using the module already available in python.
Code:
python3
#Escaping out HTML characters from html.parser import HTMLParser tweet = "I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN" tweet = HTMLParser().unescape(tweet) print ( "After removing HTML characters the tweet is:-\n{}" . format (tweet)) |
Output:
2) Encoding & Decoding Data: It is the process of converting information from simple understandable characters to complex symbols and vice versa. There are different forms of encoding &decoding like “UTF8″,”ascii” etc. available for text data. We should keep our data in a standard encoding format. The most common format is the UTF-8 format.
The given tweet is already in the UTF-8 format so we encoded it to ascii format and then decoded it to UTF-8 format to explain the process.
Code:
python3
#Encode from UTF-8 to ascii encode_tweet = tweet.encode( 'ascii' , 'ignore' ) print ( "encode_tweet = \n{}" . format (encode_tweet)) #decode from ascii to UTF-8 decode_tweet = encode_tweet.decode(encoding = 'UTF-8' ) print ( "decode_tweet = \n{}" . format (decode_tweet)) |
Output:
3) Removing URLs, Hashtags and Styles: In our text dataset, we can have hyperlinks, hashtags or styles like retweet text for twitter dataset etc. These provide no relevant information and can be removed. In hashtags, only the hash sign ‘#’ will be removed. For this, we will use the re library to perform regular expression operations.
Code:
python3
#library for regular expressions import re # remove hyperlinks tweet = re.sub(r 'https?:\/\/.\S+' , "", tweet) # remove hashtags # only removing the hash # sign from the word tweet = re.sub(r '#' , '', tweet) # remove old style retweet text "RT" tweet = re.sub(r '^RT[\s]+' , '', tweet) print ( "After removing Hashtags,URLs and Styles the tweet is:-\n{}" . format (tweet)) |
Output:
4) Contraction Replacement: The text data might contain apostrophe’s used for contractions. Example- “didn’t” for “did not” etc. This can change the sense of the word or sentence. Hence we need to replace these apostrophes with the standard lexicons. To do so we can have a dictionary which consists of the value with which the word needs to be replaced and use that.
Few of the contractions used are:- n't --> not 'll --> will 's --> is 'd --> would 'm --> am 've --> have 're --> are
Code:
python3
#dictionary consisting of the contraction and the actual value Apos_dict = { "'s" : " is" , "n't" : " not" , "'m" : " am" , "'ll" : " will" , "'d" : " would" , "'ve" : " have" , "'re" : " are" } #replace the contractions for key,value in Apos_dict.items(): if key in tweet: tweet = tweet.replace(key,value) print ( "After Contraction replacement the tweet is:-\n{}" . format (tweet)) |
Output:
5) Split attached words: Some words are joined together for example – “ForTheWin”. These need to be separated to be able to extract the meaning out of it. After splitting, it will be “For The Win”.
Code:
python3
import re #separate the words tweet = " " .join([s for s in re.split( "([A-Z][a-z]+[^A-Z]*)" ,tweet) if s]) print ( "After splitting attached words the tweet is:-\n{}" . format (tweet)) |
Output:
6 )Convert to lower case: Convert your text to lower case to avoid case sensitivity related issues.
Code:
python3
#convert to lower case tweet = tweet.lower() print ( "After converting to lower case the tweet is:-\n{}" . format (tweet)) |
Output:
7) Slang lookup: There are many slang words which are used nowadays, and they can be found in the text data. So we need to replace them with their meanings. We can use a dictionary of slang words as we did for the contraction replacement, or we can create a file consisting of the slang words. Examples of slang words are:-
asap --> as soon as possible b4 --> before lol --> laugh out loud luv --> love wtg --> way to go
We are using a file which consists of the words. You can download the file slang.txt. Source of this file was taken from here.
Code:
python3
#open the file slang.txt file = open ( "slang.txt" , "r" ) slang = file .read() #separating each line present in the file slang = slang.split( '\n' ) tweet_tokens = tweet.split() slang_word = [] meaning = [] #store the slang words and meanings in different lists for line in slang: temp = line.split( "=" ) slang_word.append(temp[ 0 ]) meaning.append(temp[ - 1 ]) #replace the slang word with meaning for i,word in enumerate (tweet_tokens): if word in slang_word: idx = slang_word.index(word) tweet_tokens[i] = meaning[idx] tweet = " " .join(tweet_tokens) print ( "After slang replacement the tweet is:-\n{}" . format (tweet)) |
Output:
8) Standardizing and Spell Check: There might be spelling errors in the text or it might not be in the correct format. For example – “drivng” for “driving” or “I misssss this” for “I miss this”. We can correct these by using the autocorrect library for python. There are other libraries available which you can use as well. First, you will have to install the library by using the command-
#install autocorrect library pip install autocorrect
Code:
python3
import itertools #One letter in a word should not be present more than twice in continuation tweet = ' '.join(' '.join(s)[: 2 ] for _, s in itertools.groupby(tweet)) print ( "After standardizing the tweet is:-\n{}" . format (tweet)) from autocorrect import Speller spell = Speller(lang = 'en' ) #spell check tweet = spell(tweet) print ( "After Spell check the tweet is:-\n{}" . format (tweet)) |
Output:
9) Remove Stopwords: Stop words are the words which occur frequently in the text but add no significant meaning to it. For this, we will be using the nltk library which consists of modules for pre-processing data. It provides us with a list of stop words. You can create your own stopwords list as well according to the use case.
First, make sure you have the nltk library installed. If not then download it using the command-
#install nltk library pip install nltk
Code:
python3
import nltk #download the stopwords from nltk using nltk.download( 'stopwords' ) #import stopwords from nltk.corpus import stopwords #import english stopwords list from nltk stopwords_eng = stopwords.words( 'english' ) tweet_tokens = tweet.split() tweet_list = [] #remove stopwords for word in tweet_tokens: if word not in stopwords_eng: tweet_list.append(word) print ( "tweet_list = {}" . format (tweet_list)) |
Output:
10) Remove Punctuations: Punctuations consists of !,<@#&$ etc.
Code:
python3
#for string operations import string clean_tweet = [] #remove punctuations for word in tweet_list: if word not in string.punctuation: clean_tweet.append(word) print ( "clean_tweet = {}" . format (clean_tweet)) |
Output:
These were some data cleaning techniques which we usually perform on the text data format. You can also perform some advanced data cleaning like grammar check etc.