Sentiment Analysis is one of the techniques of NLP (Natural Language Processing). It is part of NLU (Natural Language Understanding). It allows us to classify the sentiment of a text, positive or negative, according to the words it contains.
This blog post has three parts. The first part is about Data Collection. Web scraping using the BeautifulSoup
and urllib2
libraries. The second part, is Text Analysis, we use the NLTK
Python library to compute some statistics of the lyrics of the selected artist. And in the third part, it is about Sentiment Analysis, we use the VADER
library (yes, as in Star Wars ). We will use plot the number of positive and negative songs there is per album.
Once we collect the discography of the artist we can plot it in Wordle, there is a button to do it, but you need Java installed.
You can learn more about NLP with these articles NLP with NLTK Part 1 and Part 2. Also I can recommend you this article if you want to hack your music preferences with song’s features. In this blog post we will only use lyrics (text).
You can reuse the functions from this code to develop further this article or to create your own projects.
from bs4 import BeautifulSoup
import urllib2
import re
import pandas as pd
from IPython.core.display import display, HTML
from wordcloud import WordCloud # to plot wordclouds
import matplotlib.pyplot as plt
# this line indicates the graphs are displayed in the notebook and not in a new window
%matplotlib inline
1. Data Collection¶
We will use http://lyrics.wikia.com because it has a simple html code to parse and has a vast number of lyrics in their database. Also, it has no limits on requests.
The main function for data collection is get_lyrics()
. It takes an artist name, and downloads its discography. I recommend you to play with its parameters because you can see the cover of the album and wordclouds when you are aquiring the lyrics, just change the binary parameters.
The function plot_word_cloud()
is used to plot wordclouds of the arguments passed.
prefix = 'http://lyrics.wikia.com'
def plot_word_cloud(corpus, max_words = 42, width=600, height=400, fig_size=(8,6)):
try:
if len(corpus) == 0:
corpus = 'no words'
wordcloud = WordCloud(max_words = max_words, width=width, height=height, background_color="black").generate(corpus)
plt.figure(figsize=fig_size, dpi=80)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
return
except:
pass
return
def get_lyrics(band_name = None
, display_cover=False
, show_song_word_cloud=False
, show_album_word_cloud=False
, verbose=False):
""
Asks for the artists name and download all its lyrics.
""
def get_artist_link():
""
Asks for a term to search and returns the first result.
""
url_search = 'http://lyrics.wikia.com/wiki/Special:Search?query='
if band_name == None:
site = urllib2.urlopen(url_search + raw_input("Artist's name ?: ").replace(' ', '+')).read()
else:
site = urllib2.urlopen(url_search + band_name.replace(' ', '+')).read()
soup = BeautifulSoup(site)
links = []
for link in soup.find_all("a", class_='result-link'):
if link.get('href') <> None:
links.append(link.get('href'))
print 'Getting lyrics from...', links[0]
return links[0]
def display_thumbnail(soup):
images = soup.find_all("img", class_='thumbborder ')
for image in images:
display(HTML(str(image)))
return
def get_album_links(artist_link):
link_discs = []
site = urllib2.urlopen(artist_link).read()
soup = BeautifulSoup(site)
discs = soup.find_all("span", class_="mw-headline")
for d in discs:
for element in d.find_all('a'):
link_discs.append(prefix + element.get('href'))
return link_discs
def get_text(lyric):
text = ''
for line in lyric:
text += line
print camel_case_split(text)
return camel_case_split(text)
def get_lyrics(url):
try:
site = urllib2.urlopen(url).read()
soup = BeautifulSoup(site)
lyric = soup.find_all("div", class_="lyricbox")
if len(lyric) > 0:
for element in lyric:
return re.sub("([a-z])([A-Z])","g<1> g<2>", BeautifulSoup(str(element).replace('<br/>',' ')).get_text())
except:
pass
def get_list_of_links(url, link_filter):
links = []
site = urllib2.urlopen(url).read()
soup = BeautifulSoup(site)
if (display_cover):
display_thumbnail(soup) # displays the albums image
for link in soup.find_all("a"):
if link.get('href') <> None and '/wiki/' + link_filter + ':' in link.get('href') and not '?' in link.get('href'):
links.append(prefix + link.get('href'))
return links
def download_lyrics(album_links):
lyrics = [] # list with all the lrrics
songs = [] # list with scanned links
discography = []
i = 1
for album_link in album_links:
album = []
print 'Downloading:', i, 'out of', len(album_links), 'albums -', album_link.split(':')[-1].replace('_', ' ')
i+=1
for link in get_list_of_links(album_link, link_filter):
if get_lyrics(link) <> None and link not in songs:
lyrics.append(get_lyrics(link))
lyric = get_lyrics(link)
album.append(lyric)
if verbose:
print link.split(':')[-1].replace('_',' ') #print song title
if (show_song_word_cloud):
plot_word_cloud(lyric.lower(), max_words=50, width=400, height=200)
songs.append(link)
if show_album_word_cloud:
plot_word_cloud(str(album[:]).lower(), max_words=50, width=800, height=500)
discography.append((album_link.split(':')[-1].replace('_', ' '), album))
print 'nDone!', len(songs), 'lyrics aquired from', len(album_links), 'albums.'
return discography
artist_link = get_artist_link()
link_filter = artist_link.split('/')[-1]
album_links = get_album_links(artist_link)
lyrics = download_lyrics(album_links)
return lyrics
Demo¶
I’ll use Metallica’s lyrics for demonstration because I love MetallicA, you can try with your favorite band.
Arguments of get_lyrics()
¶
The arguments of the function are as follows:
band_name='metallica'
: Astring
to avoid to manually input the name of an artist.display_cover = False
:Boolean
variable to display theverbose = False
:Boolean
variable to show the name of the song that is been processed.show_album_word_cloud = False
:Boolean
variable to show a word-cloud with the tokens of the album.show_song_word_cloud = False
:Boolean
variable to show a word-cloud with the tokens of the discography at the end of the processing.
corpus = get_lyrics(display_cover = False # displays the cover of the album while is been proceesed
#, band_name='metallica' # name of the artist
, verbose = False # print the song titles
, show_album_word_cloud = False # shows a word-cloud per album
, show_song_word_cloud = False) # shows a word-cloud per song
# raw will contains all the text of the lyrics.
raw = ''
for title, songs in corpus:
for song in songs:
raw+=song
Wordle is a widely known service to plot wordclouds. We can send all the words in the discography to Wordle a get a wordcloud of our artist. Instrctions to do that are here: