How to Scrape Multiple Pages of a Website Using Python?

28 July 2024

2

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.

Let’s dive deep and scrape a website. In this article, we are going to take the Lazyroar website and extract the titles of all the articles available on the Homepage using a Python script.

If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any!

Lazyroar Homepage

Scraping multiple Pages of a website Using Python

Now, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation. Duh!

To solve this exact problem, we will see two main techniques that will help us extract data from multiple webpages:

The same website
Different website URLs

Approach:

The approach of the program will be fairly simple, and it will be easier to understand it in a POINT format:

We’ll import all the necessary libraries.
Set up our URL strings for making a connection using the requests library.
Parsing the available data from the target page using the BeautifulSoup library’s parser.
From the target page, Identify and Extract the classes and tags which contain the information that is valuable to us.
Prototype it for one page using a loop and then apply it to all the pages.

Example 1: Looping through the page numbers

page numbers at the bottom of the Lazyroar website

Most websites have pages labeled from 1 to N. This makes it really simple for us to loop through these pages and extract data from them as these pages have similar structures. For example:

notice the last section of the URL – page/4/

Here, we can see the page details at the end of the URL. Using this information we can easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string and iterating “i” till N) and scrape all the useful data from them. The following code will give you more clarity over how to scrape data by using a For Loop in Python.

Python

import requests
from bs4 import BeautifulSoup as bs
  
URL = 'https://www.geeksforgeeks.org/page/1/'
  
req = requests.get(URL)
soup = bs(req.text, 'html.parser')
  
titles = soup.find_all('div',attrs = {'class','head'})
  
print(titles[4].text)

Output:

Output for the above code

Now, using the above code, we can get the titles of all the articles by just sandwiching those lines with a loop.

Python

import requests
from bs4 import BeautifulSoup as bs
  
URL = 'https://www.geeksforgeeks.org/page/'
  
for page in range(1,10):
      # pls note that the total number of
    # pages in the website is more than 5000 so i'm only taking the
    # first 10 as this is just an example
  
    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
  
    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)

Output:

Output for the above code

Note: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages.

Example 2: Looping through a list of different URLs.

The above technique is absolutely wonderful, but what if you need to scrape different pages, and you don’t know their page numbers? You’ll need to scrape those different URLs one by one and manually code a script for every such webpage.

Instead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i.e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here’s an example code of how you can do it.

Python

import requests
from bs4 import BeautifulSoup as bs
URL = ['https://www.geeksforgeeks.org','https://www.geeksforgeeks.org/page/10/']
  
for url in range(0,2):
    req = requests.get(URL[url])
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
    for i in range(4, 19):
        if url+1  > 1:
            print(f"{(i - 3) + url * 15}" + titles[i].text)
        else:
            print(f"{i - 3}" + titles[i].text)

Output:

Output for the above code

How to avoid getting your IP address banned?

Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively.

Python3

from random import randint
from time import sleep 
  
print(randint(1,10))

Output

The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.

Python3

from time import *
from random import randint
  
for i in range(0,3):
  # selects random integer in given range
  x = randint(2,5)
  print(x)
  sleep(x)
  print(f'I waited {x} seconds')

Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

To get you a clear idea of this function in action, refer to the code given below.

Python3

import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep
  
URL = 'https://www.geeksforgeeks.org/page/'
  
for page in range(1,10): 
      # pls note that the total number of
    # pages in the website is more than 5000 so i'm only taking the
    # first 10 as this is just an example
  
    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
  
    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)
  
    sleep(randint(2,10))

Output:

The program has paused its execution and is waiting to resume

The output of the above code

How to Scrape Multiple Pages of a Website Using Python?

Scraping multiple Pages of a website Using Python

Python

Python

Python

How to avoid getting your IP address banned?

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

10 Best Cheap VPNs for Torrenting in 2025: Safe Downloads by Raven Wu

Recent Comments

EDITOR PICKS

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

POPULAR POSTS

5 Best VPNs for Government Bypassing (Tested in 2025) by Katarina Glamoslija

5 Best Linux VPNs in 2025: Expert Tested & Reviewed by Raven Wu

5 Best VPNs for Sky Go in 2025: Fast & Easy to Use by Tim Mocan

POPULAR CATEGORY

ABOUT US

FOLLOW US