Web Scraping – Amazon Customer Reviews

By Shaida Kate Naidoo

27 July 2024

0

In this article, we are going to see how we can scrape the amazon customer review using Beautiful Soup in Python.

Module needed

bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

To begin with web scraping, we first have to do some setup. Import all the required modules. Get the cookies data for making the request to amazon, without this you can not able to scrape. Create a header that contains your request cookies, without cookies you can not scrape amazon data it always shows some error. This website will provide you, specific user agent.

Pass the URL in the getdata() function(User Defined Function) to that will request to a URL, it returns a response. We are using get method to retrieve information from the given server using a given URL.

Syntax:

requests.get(url, args)

Convert that data into HTML code and then Parse the HTML content using bs4.

Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)

Parameters:

r.content : It is the raw HTML content.

html.parser : Specifying the HTML parser we want to use.

Now filter the required data using soup.Find_all function.

Program:

Python3

# import module
import requests
from bs4 import BeautifulSoup
  
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
            AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/90.0.4430.212 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'})
  
# user define function
# Scrape the data
def getdata(url):
    r = requests.get(url, headers=HEADERS)
    return r.text
  
  
def html_code(url):
  
    # pass the url
    # into getdata function
    htmldata = getdata(url)
    soup = BeautifulSoup(htmldata, 'html.parser')
  
    # display html code
    return (soup)
  
  
url = "https://www.amazon.in/Columbia-Mens-wind-\
resistant-Glove/dp/B0772WVHPS/?_encoding=UTF8&pd_rd\
_w=d9RS9&pf_rd_p=3d2ae0df-d986-4d1d-8c95-aa25d2ade606&pf\
_rd_r=7MP3ZDYBBV88PYJ7KEMJ&pd_rd_r=550bec4d-5268-41d5-\
87cb-8af40554a01e&pd_rd_wg=oy8v8&ref_=pd_gw_cr_cartx&th=1"
  
soup = html_code(url)
print(soup)

Output:

Note: This is only HTML code or Raw data.

Now since the core setup is done let us see how scraping for a specific requirement can be done.

Scrape customer name

Now find the customer list with span tag where class_ = a-profile-name. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.

You have to pass the tag name and attribute with its corresponding value to the find_all() function.

Code:

Python

def cus_data(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
  
    for item in soup.find_all("span", class_="a-profile-name"):
        data_str = data_str + item.get_text()
        cus_list.append(data_str)
        data_str = ""
    return cus_list
  
  
cus_res = cus_data(soup)
print(cus_res)

Output:

[‘Amaze’, ‘Robert’, ‘D. Kong’, ‘Alexey’, ‘Charl’, ‘RBostillo’]

Scrape user review:

Now find the customer review as same above methods. Find the unique class name with a specific tag, here we use div tag.

Code:

Python3

def cus_rev(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
  
    for item in soup.find_all("div", class_="a-expander-content \
    reviewText review-text-content a-expander-partial-collapse-content"):
        data_str = data_str + item.get_text()
  
    result = data_str.split("\n")
    return (result)
  
  
rev_data = cus_rev(soup)
rev_result = []
for i in rev_data:
    if i is "":
        pass
    else:
        rev_result.append(i)
rev_result

Output:

Scraping Production information

Here we will scrape product information like product name, ASIN number, Weight, dimension. By doing this we will use the span tag and with a specific unique class name.

Code:

Python3

def product_info(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    pro_info = []
  
    for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\
    a-vertical a-spacing-none detail-bullet-list"):
        data_str = data_str + item.get_text()
        pro_info.append(data_str.split("\n"))
        data_str = ""
    return pro_info
  
  
pro_result = product_info(soup)
  
# Filter the required data
for item in pro_result:
    for j in item:
        if j is "":
            pass
        else:
            print(j)

Output:

Scraping Review Image:

Here we will extract the image link from the review of the product using the same as the above methods. The tag name and attribute of the tag is passed to findAll() as above.

Code:

Python3

def rev_img(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
    images = []
    for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"):
        images.append(img.get('src'))
    return images
  
  
img_result = rev_img(soup)
img_result

Output:

Saving details into CSV file:

Here we will save the details into the CSV file, We will convert the data into dataframe and then export it into the CSV, Let us see how to export a Pandas DataFrame to a CSV file. We will be using the to_csv() function to save a DataFrame as a CSV file.

Syntax : to_csv(parameters)
Parameters :

path_or_buf : File path or object, if None is provided the result is returned as a string.

Code:

Python3

import pandas as pd
  
# initialise data of lists.
data = {'Name': cus_res,
        'review': rev_result}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Save the output.
df.to_csv('amazon_review.csv')

Output:

Web Scraping – Amazon Customer Reviews

Module needed

Python3

Scrape customer name

Python

Scrape user review:

Python3

Scraping Production information

Python3

Scraping Review Image:

Python3

Saving details into CSV file:

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

The Evolution of Phishing Attacks and How to Combat Them Copy by

Recent Comments

EDITOR PICKS

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

POPULAR POSTS

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US