In this article, we are going to discuss how to not get caught while web scraping. Let’s look at all such alternatives in detail:
Robots.txt
- It is a text file created by the webmaster which tells the search engine crawlers which pages are allowed to be crawled by the bot, so it is better to respect robots.txt before scraping.
- Example: Here GFG’s robot.txt has “User-agent: *” meaning this section applies to all robots and few websites are not allowed to be crawled by any web crawlers.
IP Rotation
- Sending too many requests from a single IP address is a clear indication that you are automating HTTP/HTTPS requests and the webmaster will surely block your IP address to stop further scraping
- The best alternative is to use proxies and rotate them after a certain amount of requests from a single proxy, this reduces the chances of IP blocking and the scraper remains unaffected.
- Always make sure to get premium proxies especially Residential IP address’s since Data Center IP addresses are very likely flagged by other users and may return connection errors.
Proxy Types:
- Data Center Proxy: These proxies are from cloud service providers and are sometimes flagged as many people use them, but since they are cheaper, a pool of proxies can be brought for web scraping activities.
- Residential IP Proxy: These proxies contain IP address from local ISP so the webmaster cannot detect if it is a scraper or a real person browsing the website. They are very expensive compared to Data Center Proxies and may cause legal consents as the owner isn’t fully aware if you are using their IP for web scraping purposes.
- Mobile IP Proxy: These proxies are IPs of private mobile devices and work similarly to Residential IP Proxies. They are very expensive and may cause legal consents as the device owner isn’t fully aware if you are using their GSM network for web scraping since they are provided by mobile network operators.
Example:
- Create a pool of proxy servers and rotate or iterate them.
- Import requests module and send a GET request to “https://www.geeksforgeeks.org/” along with the proxy.
Syntax:
requests.get(url, proxies={“http”: proxy, “https”: proxy})
- The response received after sending GET request is the IP address of the current proxy server if there is no connection error.
Program:
Python3
# Import the required Modules import requests # Create a pool of proxies proxies = { } # Iterate the proxies and check if it is working. for proxy in proxies: try : # https://ipecho.net/plain returns the ip address # of the current session if a GET request is sent. page = requests.get(url, proxies = { "http" : proxy, "https" : proxy}) # Prints Proxy server IP address if proxy is alive. print ( "Status OK, Output:" , page.text) except OSError as e: # Proxy returns Connection error print (e) |
User-Agent
- The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.
- Some websites require a major browser’s User-Agent or else it won’t allow viewing the content, so the best way is to create a list of fake user-agents and iterate them or use UserAgent from fake-useragent module and pass it as a header while requesting a website.
Example:
Python3
# Create a list of User-Agents import requests header = { 'User-Agent' : 'Mozilla / 5.0 (Windows NT 6.1 ) \ AppleWebKit / 537.2 (KHTML, like Gecko) Chrome / 22.0 . 1216.0 \ Safari / 537.2 '} response = requests.get(url, headers = header) # Use UserAgent from fake_useragent module import requests from fake_useragent import UserAgent ua = UserAgent() header = { 'User-Agent' : str (ua.chrome)} response = requests.get(url, headers = header) |
Referrer Header
- The Referrer header is an HTTP request header that lets the site know what site you are arriving from.
- If you’re arriving from Google, provide the referrer_url in referrer header while sending a GET request.
Syntax:
requests.get(url, headers={‘referrer’: referrer_url})
Headless Browser
- Using Selenium/Puppeteer in headless mode is much better since the website is being scraped by automating a browser.
- It is mostly used to scrap dynamic websites and many features like pagination, authentication can be automated here.
- JavaScript commands can also be executed here.
Example:
Python3
# Using Selenium Chrome Webdriver to create # a headless browser from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument( 'headless' ) driver = webdriver.Chrome(executable_path = "C:/chromedriver.exe" , chrome_options = options) driver.get(url) |
Time Intervals
- It is better to apply some random time intervals using the time module in the code to increase the duration which reduces the chances of blocking.
Example:
time.sleep(1) # Sleeps for 1 second
Captcha Solving
- Few websites detect web crawlers through a Captcha, which can be solved by implementing captcha solving services or wait for few hours or change the IP address to resume scraping.
- These services cost extra and may increase the time to scrape data from the websites. Consider the extra time and expenses that you may have to bear if you choose to use a CAPTCHA Solving Service.
There are a few CAPTCHA solving services like:
- Anti Captcha.
- DeathByCaptcha
Avoid Honeypot Traps
- A lot of sites will try to detect web crawlers by putting in invisible links that only a crawler would follow.
- Detect whether a link has the “display: none” or “visibility: hidden” CSS properties set, and should avoid following that link, otherwise it will identify you as a scraper. Honeypots are one of the easiest ways for smart webmasters to detect crawlers and block all the requests made by the user.
Detect Website Changes
- Many websites change layouts for many reasons and this will often cause scrapers to fail.
- In addition, some websites will have different layouts in unexpected places.
- So a perfect monitoring system should be present that detects changes in layouts and gives an alert to fix the code.
So these are the ways by which you can avoid getting caught during web scraping.