Web Scraping refers to the process of scraping/extracting data from a website using the HTTP protocol or web browser. The process can either be manual or it can be automated using a bot or a web crawler. Also, there is a misconception about web scraping being illegal, the truth is that it is perfectly legal unless you’re trying to access non-public data(data that not reachable to the public like login credentials).
When you scrape through small websites, you might not face any issues. But when you try web scraping on some big websites or even Google you might find your requests getting ignored or even your IP getting blocked.
In this article, we will suggest to you some of the best practices that can be followed while scraping data off the web without getting yourself(your IP) blocked.
Method 1: Using Rotating Proxies
If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies.
A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool. We need to use proxies and rotate our IP addresses in order to avoid getting detected by the website owners. All we need to do is write a script that lets us use any IP address from the pool and let us request using that same IP. The purpose behind using the concept of rotating IPs is to make it look that you’re not a bot but a human, accessing data from different locations from different parts of the world.
The first step involves finding a proxy: There are many websites that provide free proxies over the internet. One of them is https://free-proxy-list.net/ .
Proxy used here:
- IP: 180.179.98.22
- Port: 3128
Similarly, we can get a list of proxies from https://free-proxy-list.net/ either manually or automating the process using a scraper.
Program:
Python
import requests # use to parse html text from lxml.html import fromstring from itertools import cycle import traceback def to_get_proxies(): # website to get free proxies response = requests.get(url) parser = fromstring(response.text) # using a set to avoid duplicate IP entries. proxies = set () for i in parser.xpath( '//tbody/tr' )[: 10 ]: # to check if the corresponding IP is of type HTTPS if i.xpath( './/td[7][contains(text(),"yes")]' ): # Grabbing IP and corresponding PORT proxy = ":" .join([i.xpath( './/td[1]/text()' )[ 0 ], i.xpath( './/td[2]/text()' )[ 0 ]]) proxies.add(proxy) return proxies |
Output:
proxies={‘160.16.77.108:3128’, ‘20.195.17.90:3128’, ‘14.225.5.68:80’, ‘158.46.127.222:52574’, ‘159.192.130.233:8080’, ‘124.106.224.5:8080’, ‘51.79.157.202:443’, ‘161.202.226.194:80’}
Now we have the list of proxy IP address available in a set. We’ll rotate the IP using the round-robin method.
Program:
Python
proxies = to_get_proxies() # to rotate through the list of IPs proxyPool = cycle(proxies) # insert the url of the website you want to scrape. url = '' for i in range ( 1 , 11 ): # Get a proxy from the pool proxy = next (proxyPool) print ( "Request #%d" % i) try : response = requests.get(url, proxies = { "http" : proxy, "https" : proxy}) print (response.json()) except : # One has to try the entire process as most # free proxies will get connection errors # We will just skip retries. print ( "Skipping. Connection error" ) |
Output:
Request #1 Skipping. Connection error Request #2 Skipping. Connection error Request #3 Skipping. Connection error Request #4 Skipping. Connection error Request #5 Skipping. Connection error Request #6 Skipping. Connection error Request #7 Skipping. Connection error Request #8 Skipping. Connection error Request #9 Skipping. Connection error Request #10 Skipping. Connection error
Things to keep in mind while using proxy rotation:
- Avoid using Proxy IP address that is in some sequence :
Any anti-scraping plugin will detect a scraper if the requests come from a similar subnet or are in continuous succession. For example : 132.12.12.1, 132.12.12.2, 132.12.12.3, 132.12.12.4 are in the same sequence. You should avoid using IPs like these as most of the anti-scraping plugins are designed to block IPs that are in some range or are in some particular sequence.
- Use paid proxies:
Free proxies tend to die out soon. Moreover, free proxies are overly used over the internet and are already blacklisted by most anti-scraping tools. Also, you can automate the process if you’re using free proxies to prevent the disruption of the scraping process.
Method 2: Use IPs of Google Cloud Platform
It may be helpful to use Google Cloud Functions as the hosting platform for your web scraper combined with changing user-agent to GoogleBot. It will appear to the website that you’re a GoogleBot and not a scraper. GoogleBot is a web crawler designed by Google which visits sites every few seconds and collects documents from the site to build a searchable index for the Google Search engine. As most of the websites do not block GoogleBot, there are higher chances of your crawler not getting blocked if you use Google Cloud functions as the hosting platform.
Method 3: Web Scrape Slowly
When we scrape data using an automated scraper, the scraper scrapes the data at an inhuman speed which is easily detected by anti-scrapers plugins. By adding random delays and actions to our scraper we can make it resemble a human, so the website owners don’t detect it. Sending requests too fast can crash the website for all the users. Keep the number of requests under a limit so that you don’t overload the website server and get your IP blocked.
Also, you can check what should be the delay between two requests by using a site’s robot.txt. Often you can find a crawl-delay field on the robot.txt page which tells exactly what should be a delay between requests to avoid getting recognized as a crawler.
Method 4:Web Scrape at different day times
Logging into the same website at different day times also reduces your footprint. For example: If you start scraping at 8:00 AM every day, then start at scraping at time like 8:20 or 8:25 AM for the next few days. Adding a few minutes in your start time each day can prove to be tremendously helpful in escaping the crawler’s detection algorithm.
Method 5: Use a CAPTCHA Solving Service
Most websites use CAPTCHA to detect bot traffic. We can use a CAPTCHA solving service to easily bypass this extra layer of security. There are a few CAPTCHA solving services like:
- Anti Captcha.
- DeathByCaptcha
The point to remember is that these services cost extra and may increase the time to scrape data from the websites. So one should consider the extra time and expenses that you may have to bear if you choose to use a CAPTCHA Solving Service.
Method 6: Scraping from Google Cache:
To scrape data from those websites whose data changes infrequently, we can use Google cache. Google keeps a cached copy of some websites. Rather than making a request to the original one, you can make a request to its cached data. In order to access the cache on any web page, add the URL of the website in front of this URL
Syntax:
http://webcache.googleusercontent.com/search?q=cache:URL(URL of the website you want to scrape).
Method 7: User-Agent
It is a character string that lets servers and peers identify the application or the version of the requesting user’s OS. Some sites block the user agents if it’s not from a major browser. If they are not set, many websites won’t allow to access the content. You can find your user agent in two ways:
- Typing – “What is my user agent on Google”
- You can find the user agent string on this website – http://www.whatsmyuseragent.com/.
The solution to this problem is that you need to either create a list of user agents or use libraries like fake-useragent(python).
Method 8: Headless Browser
Websites change their content according to the browser you’re requesting the data from. The issue while scraping some websites is that the content is rendered by the JavaScript Code(while scraping) and not HTML. To scrape such websites you may need to deploy your own custom headless browser. Automation browsers like Selenium and Puppeteer also can be used to control and scrape such dynamic websites. It is a lot of effort, but this is the most efficient way.