In this article, we will describe how it is possible to build a simple multithreading-based crawler using Python.
Modules Needed
bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.
pip install bs4
requests: This library allows you to send HTTP/1.1 requests very easily. To install this library, type the following command in IDE/terminal.
pip install requests
Stepwise implementation
Step 1: We will first import all the libraries that we need to crawl. If you’re using Python3, you should already have all the libraries except BeautifulSoup, requests. So if you haven’t installed these two libraries yet, you’ll need to install them using the commands specified above.
Python3
import multiprocessing from bs4 import BeautifulSoup from queue import Queue, Empty from concurrent.futures import ThreadPoolExecutor from urllib.parse import urljoin, urlparse import requests |
Step 2: Create a main program and then create an object of class MultiThreadedCrawler and pass the seed URL to its parameterized constructor, and call run_web_scrawler() method.
Python3
if __name__ = = '__main__' : cc.run_web_crawler() cc.info() |
Step 3: Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor, assign base URL to the instance variable named seed_url. And then format the base URL into absolute URL, using schemes as HTTPS and net location.
To execute the crawl frontier task concurrently use multithreading in python. Create an object of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And to avoid duplicate visits to web pages, In order to maintain the history create a set data structure.
Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL.
Python3
class MultiThreadedCrawler: def __init__( self , seed_url): self .seed_url = seed_url self .root_url = '{}://{}' . format (urlparse( self .seed_url).scheme, urlparse( self .seed_url).netloc) self .pool = ThreadPoolExecutor(max_workers = 5 ) self .scraped_pages = set ([]) self .crawl_queue = Queue() self .crawl_queue.put( self .seed_url) |
Step 4: Create a method named run_web_crawler(), to keep on adding the link to frontier and extracting the information use an infinite while loop and display the name of the currently executing process.
Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether the current URL is already visited or not. If not visited already, Format the current URL and add it to scraped_pages set to store in the history of visited pages and choose from a pool of threads and pass scrape page and target URL.
Python3
def run_web_crawler( self ): while True : try : print ( "\n Name of the current executing process: " , multiprocessing.current_process().name, '\n' ) target_url = self .crawl_queue.get(timeout = 60 ) if target_url not in self .scraped_pages: print ( "Scraping URL: {}" . format (target_url)) self .scraped_pages.add(target_url) job = self .pool.submit( self .scrape_page, target_url) job.add_done_callback( self .post_scrape_callback) except Empty: return except Exception as e: print (e) continue |
Step 5: Using the handshaking method place the request and set default time as 3 and maximum time as 30 and once the request is successful return the result set.
Python3
def scrape_page( self , url): try : res = requests.get(url, timeout = ( 3 , 30 )) return res except requests.RequestException: return |
Step 6: Create a method named scrape_info(). And pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structure.
Using the BeautifulSoup operator extract all the text present in the HTML document.
Python3
def scrape_info( self , html): soup = BeautifulSoup(html, "html5lib" ) web_page_paragraph_contents = soup( 'p' ) text = '' for para in web_page_paragraph_contents: if not ( 'https:' in str (para.text)): text = text + str (para.text).strip() print ( '\n <-----Text Present in The WebPage is--->\n' , text, '\n' ) return |
Step 7: Create a method named parse links, using BeautifulSoup operator extract all the anchor tags present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or relative URL.
- Relative URL: URL Without root URL and protocol names.
- Absolute URLs: URL With protocol name, Root URL, Document name.
If it is a Relative URL using urljoin method change it to an absolute URL using the base URL and relative URL. Check whether the current URL is already visited or not. If the URL has not been visited already, put it in the crawl queue.
Python3
def parse_links( self , html): soup = BeautifulSoup(html, 'html.parser' ) Anchor_Tags = soup.find_all( 'a' , href = True ) for link in Anchor_Tags: url = link[ 'href' ] if url.startswith( '/' ) or url.startswith( self .root_url): url = urljoin( self .root_url, url) if url not in self .scraped_pages: self .crawl_queue.put(url) |
Step 8: For extracting the links call the method named parse_links() and pass the result. For extracting the content call the method named scrape_info() and pass the result.
Python3
def post_scrape_callback( self , res): result = res.result() if result and result.status_code = = 200 : self .parse_links(result.text) self .scrape_info(result.text) |
Below is the complete implementation:
Python3
import multiprocessing from bs4 import BeautifulSoup from queue import Queue, Empty from concurrent.futures import ThreadPoolExecutor from urllib.parse import urljoin, urlparse import requests class MultiThreadedCrawler: def __init__( self , seed_url): self .seed_url = seed_url self .root_url = '{}://{}' . format (urlparse( self .seed_url).scheme, urlparse( self .seed_url).netloc) self .pool = ThreadPoolExecutor(max_workers = 5 ) self .scraped_pages = set ([]) self .crawl_queue = Queue() self .crawl_queue.put( self .seed_url) def parse_links( self , html): soup = BeautifulSoup(html, 'html.parser' ) Anchor_Tags = soup.find_all( 'a' , href = True ) for link in Anchor_Tags: url = link[ 'href' ] if url.startswith( '/' ) or url.startswith( self .root_url): url = urljoin( self .root_url, url) if url not in self .scraped_pages: self .crawl_queue.put(url) def scrape_info( self , html): soup = BeautifulSoup(html, "html5lib" ) web_page_paragraph_contents = soup( 'p' ) text = '' for para in web_page_paragraph_contents: if not ( 'https:' in str (para.text)): text = text + str (para.text).strip() print (f '\n <---Text Present in The WebPage is --->\n' , text, '\n' ) return def post_scrape_callback( self , res): result = res.result() if result and result.status_code = = 200 : self .parse_links(result.text) self .scrape_info(result.text) def scrape_page( self , url): try : res = requests.get(url, timeout = ( 3 , 30 )) return res except requests.RequestException: return def run_web_crawler( self ): while True : try : print ( "\n Name of the current executing process: " , multiprocessing.current_process().name, '\n' ) target_url = self .crawl_queue.get(timeout = 60 ) if target_url not in self .scraped_pages: print ( "Scraping URL: {}" . format (target_url)) self .current_scraping_url = "{}" . format (target_url) self .scraped_pages.add(target_url) job = self .pool.submit( self .scrape_page, target_url) job.add_done_callback( self .post_scrape_callback) except Empty: return except Exception as e: print (e) continue def info( self ): print ( '\n Seed URL is: ' , self .seed_url, '\n' ) print ( 'Scraped pages are: ' , self .scraped_pages, '\n' ) if __name__ = = '__main__' : cc.run_web_crawler() cc.info() |
Output: