We can scrape the IMDb movie ratings and their details with the help of the BeautifulSoup library of Python.
Modules Needed:
Below is the list of modules required to scrape from IMDB.
- requests: Requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scraping, requests must be learned for proceeding further with these technologies. When one makes a request to a URI, it returns a response.
- html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
- bs4: BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster.
- pandas: Pandas is a library made over the NumPy library which provides various data structures and operators to manipulate the numerical data.
Approach:
Steps to implement web scraping in python to extract IMDb movie ratings and its ratings:
- Import the required modules.
Python3
from bs4 import BeautifulSoup import requests import re import pandas as pd |
- Access the HTML content from the webpage by assigning the URL and creating a soap object.
Python3
# Downloading imdb top 250 movie's data response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser" ) |
- Extract the movie ratings and their details. Here, we are extracting data from the BeautifulSoup object using Html tags like href, title, etc.
Python3
movies = soup.select( 'td.titleColumn' ) crew = [a.attrs.get( 'title' ) for a in soup.select( 'td.titleColumn a' )] ratings = [b.attrs.get( 'data-value' ) for b in soup.select( 'td.posterColumn span[name=ir]' )] |
- After extracting the movie details, create an empty list and store the details in a dictionary, and then add them to a list.
Python3
# create a empty list for storing # movie information list = [] # Iterating over movies to extract # each movie's details for index in range ( 0 , len (movies)): # Separating movie into: 'place', # 'title', 'year' movie_string = movies[index].get_text() movie = ( ' ' .join(movie_string.split()).replace( '.' , '')) movie_title = movie[ len ( str (index)) + 1 : - 7 ] year = re.search( '\((.*?)\)' , movie_string).group( 1 ) place = movie[: len ( str (index)) - ( len (movie))] data = { "place" : place, "movie_title" : movie_title, "rating" : ratings[index], "year" : year, "star_cast" : crew[index], } list .append(data) |
- Now or list is filled with top IMBD movies along with their details. Then display the list of movie details
Python3
for movie in list : print (movie[ 'place' ], '-' , movie[ 'movie_title' ], '(' + movie[ 'year' ] + ') -' , 'Starring:' , movie[ 'star_cast' ], movie[ 'rating' ]) |
- By using the following lines of code the same data can be saved into a .csv file be further used as a dataset.
Python3
#saving the list as dataframe #then converting into .csv file df = pd.DataFrame( list ) df.to_csv( 'imdb_top_250_movies.csv' ,index = False ) |
Implementation: Complete Code
Python3
from bs4 import BeautifulSoup import requests import re import pandas as pd # Downloading imdb top 250 movie's data response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser" ) movies = soup.select( 'td.titleColumn' ) crew = [a.attrs.get( 'title' ) for a in soup.select( 'td.titleColumn a' )] ratings = [b.attrs.get( 'data-value' ) for b in soup.select( 'td.posterColumn span[name=ir]' )] # create a empty list for storing # movie information list = [] # Iterating over movies to extract # each movie's details for index in range ( 0 , len (movies)): # Separating movie into: 'place', # 'title', 'year' movie_string = movies[index].get_text() movie = ( ' ' .join(movie_string.split()).replace( '.' , '')) movie_title = movie[ len ( str (index)) + 1 : - 7 ] year = re.search( '\((.*?)\)' , movie_string).group( 1 ) place = movie[: len ( str (index)) - ( len (movie))] data = { "place" : place, "movie_title" : movie_title, "rating" : ratings[index], "year" : year, "star_cast" : crew[index], } list .append(data) # printing movie details with its rating. for movie in list : print (movie[ 'place' ], '-' , movie[ 'movie_title' ], '(' + movie[ 'year' ] + ') -' , 'Starring:' , movie[ 'star_cast' ], movie[ 'rating' ]) ##.......## df = pd.DataFrame( list ) df.to_csv( 'imdb_top_250_movies.csv' ,index = False ) |
Output:
Along with this in the terminal, a .csv file with a given name is saved in the same file and the data in the .csv file will be as shown in the following image.