Sunday, November 17, 2024
Google search engine
HomeLanguagesWeb Scraping Coronavirus Data into MS Excel

Web Scraping Coronavirus Data into MS Excel

Prerequisites: Web Scraping using BeautifulSoup

Coronavirus cases are increasing rapidly worldwide. This article will guide you on how to web scrape Coronavirus data and into Ms-excel.

What is Web Scraping?

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale. Web scraping, also known as online data mining, is the method of extracting or scraping data from a website. This knowledge is gathered and then translated to a medium that is more accessible to the user. It’s either a spreadsheet or an API.

Approach: 

  1. Request for a response from the webpage.
  2. Parse and extract with the help of the BeautifulSoup() class method and lxml module.
  3. Download and export the data with pandas into Excel.

The Data Source:

We need a webpage to fetch the coronavirus data. So we will be using the Worldometer website here. Worldometer’s webpage will look something like this:

data source 

Programmatic Implementation

There are a few libraries you will need, so first, you need to install them.

Go to your command line and install them.

pip install requests
pip install lxml
pip install bs4

Now let’s see what we can do with these libraries.

Below are the steps for Web Scraping Coronavirus Data into Excel:

Step 1) Use the requests library to grab the page.

Python3




# Import required module
import requests
 
# Make requests from webpage


The request library that we downloaded goes and gets a response, to get a request from the webpage, we use requests.get(website URL) method. If the request is successful, it will be stored as a giant python string. We will be able to fetch the complete webpage source code when we run result.text. But the code will not be structured.

Note: This may fail if you have a firewall blocking Python/Jupyter. Sometimes you need to run this twice if it fails the first time.

Step 2) Use BeautifulSoup() method to extract data from websites. 

bs4 library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). It is a Python library for pulling data out of HTML and XML files. Using BeautifulSoup() method of bs4 module we can create a soup object that contains all the ingredients of the webpage.

Python3




# Import required modules
import bs4
 
# Creating soup object
soup = bs4.BeautifulSoup(result.text,'lxml')


Importing bs4 is to create a BeautifulSoup object. And we’re going to pass on two things here, result.text string and lxml as a string as a constructor argument. lxml goes through this HTML document and then figures out different CSS classesids, HTML elements, and tags, etc.

Extracting the data, to find the element, you need to right-click and hit inspect on the number of cases. Refer to the attached snapshot below. 

inspecting the website

We need to find the right class i.e. class_= ‘maincounter-number’ serves our purpose. Refer to the attached snapshot below. 

Finding the right class

The BeautifulSoup object has been created in our Python script and the HTML data of the website has been scraped off of the page. Next, we need to get the data that we are interested in, out of the HTML code. 

Python3




# Searching div tags having maincounter-number class
cases = soup.find_all('div' ,class_= 'maincounter-number')


Input Screenshot (Inspect element): 

There is still a lot of HTML code that we do not want. Our desired data entries is wrapped in the HTML div element and inside class_= ‘maincounter-number’. We can use this knowledge to further clean up the scraped data.

Step 3) Storing the data

We need to save the scraped data in some form that can be used effectively. For this project, all the data will be saved in a Python list. 

Python3




# List to store number of cases
data = []
 
# Find the span and get data from it
for i in cases:
    span = i.find('span')
    data.append(span.string)
 
# Display number of cases
print(data)


Input Screenshot (inspect element):

Output:

We will use span to fetch data from div. We need the number of cases only, not the tags. So we will use span.string to get those numbers, and then they are stored in data[].

Now that we have the number of cases, we are ready to export our data into an Excel file.

Step 4) Processing the data

Our last step is to export the data to Ms-excel, for which we are going to use the pandas module. To load the pandas module and start working with it, import the package.  

Python3




import pandas as pd
 
# Creating dataframe
df = pd.DataFrame({"CoronaData": data})
 
# Naming the columns
df.index = ['TotalCases', ' Deaths', 'Recovered']


DataFrame is a 2D labeled data structure, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

df = pd.DataFrame({“CoronaData”: data}) is used to create a DataFrame and give it a name and map it to the data list that we created earlier.

Next, we will give column names with df.index.

Output: 

Step 5) Exporting data into Excel

We are ready to export the data into Excel. We will use df.to_csv() method for this task. 

Python3




# Exporting data into Excel
df.to_csv('Corona_Data.csv')


Output: 

Below is the complete program from the above steps: 

Python3




# Import required modules
import requests
import bs4
import pandas as pd
 
 
 
# Make requests from webpage
result = requests.get(url)
 
 
 
# Creating soup object
soup = bs4.BeautifulSoup(result.text,'lxml')
 
 
 
# Searching div tags having maincounter-number class
cases = soup.find_all('div' ,class_= 'maincounter-number')
 
 
 
# List to store number of cases
data = []
 
# Find the span and get data from it
for i in cases:
    span = i.find('span')
    data.append(span.string)
 
# Display number of cases
print(data)
 
 
   
# Creating dataframe
df = pd.DataFrame({"CoronaData": data})
 
# Naming the columns
df.index = ['TotalCases', ' Deaths', 'Recovered']
 
 
 
# Exporting data into Excel
df.to_csv('Corona_Data.csv')


Final Result: 

 

RELATED ARTICLES

Most Popular

Recent Comments