Facebook Instagram Twitter Vimeo Youtube
Sign in
  • Home
  • About
  • Team
  • Buy now!
Sign in
Welcome!Log into your account
Forgot your password?
Privacy Policy
Password recovery
Recover your password
Search
Logo
Sign in
Welcome! Log into your account
Forgot your password? Get help
Privacy Policy
Password recovery
Recover your password
A password will be e-mailed to you.
Thursday, August 28, 2025
Sign in / Join
  • Contact Us
  • Our Team
Facebook
Instagram
Twitter
Vimeo
Youtube
Logo
  • Home
  • News
    • News

      North Korean Hacking Tools Leak Online, Including Advanced Linux Rootkit by Paige Henley

      28 August 2025
      News

      iiNet Cyberattack Exposes Data of 280,000 Customers by Husain Parvez

      28 August 2025
      News

      ScreenConnect Super Admins Hit by Credential Harvesting Campaign by Husain Parvez

      28 August 2025
      News

      AT&T Reaches $177 Million Settlement After Major 2024 Data Breaches by Paige Henley

      28 August 2025
      News

      US Authorities Dismantle Rapper Bot, One of the Largest DDoS-for-Hire Networks by Husain Parvez

      28 August 2025
  • Data Modelling & AI
    • AllBig dataBusiness AnalyticsData ScienceData Structure & AlgorithmDatabasesVector DatabaseDeep LearningEthical HackingGenerative AIMachine Learning
      Big data

      Stop Your AI Assistant from Writing Outdated Code with Milvus SDK Code Helper

      26 August 2025
      Big data

      A Practical Guide for Choosing the Right Vector Database for Your AI Applications

      26 August 2025
      Big data

      Why I’m Against Claude Code’s Grep-Only Retrieval? It Just Burns Too Many Tokens

      26 August 2025
      Big data

      Hands-On with VDBBench: Benchmarking Vector Databases for POCs That Match Production

      16 August 2025
    • Big data
    • Business Analytics
    • Databases
    • Data Structure & Algorithm
    • Data Science
    • Deep Learning
    • Ethical Hacking
    • Generative AI
    • Machine Learning
    • Security & Testing
  • Mobile
    • AllAndroidIOS
      Android

      Samsung and Microsoft are bringing Copilot to your living room

      28 August 2025
      Android

      Google warns: Change your Gmail password now to stay out of harm’s way

      28 August 2025
      Android

      Why the Google Pixel 10 Pro’s storage upgrade is a big deal for speed and longevity

      28 August 2025
      Android

      Keep the kids entertained with the Amazon Kindle Kids while it’s on sale for $100

      28 August 2025
    • Android
    • IOS
  • Languages
    • AllAjaxAngularDynamic ProgrammingGolangJavaJavascriptPhpPythonReactVue
      Languages

      Working with Titles and Heading – Python docx Module

      25 June 2025
      Languages

      Creating a Receipt Calculator using Python

      25 June 2025
      Languages

      One Liner for Python if-elif-else Statements

      25 June 2025
      Languages

      Add Years to datetime Object in Python

      25 June 2025
    • Java
    • Python
  • Guest Blogs
  • Discussion
  • Our Team
HomeData Modelling & AIBig dataDesign a Web Crawler - System Design Interview
Big dataGuest Blogs

Design a Web Crawler – System Design Interview

Algomaster
By Algomaster
28 June 2025
0
4
Share
Facebook
Twitter
Pinterest
WhatsApp

    Design a Web Crawler – System Design Interview

    Ashish Pratap Singh's avatar

    Ashish Pratap Singh
    Jun 08, 2025
    ∙ Paid

    A web crawler (also known as a spider) is an automated bot that systematically browses the internet, following links from page to page to discover and collect web content.

    Traditionally, web crawlers have been used by search engines to discover and index web pages. In recent years, they’ve also become essential for training large language models (LLMs) by collecting massive amounts of publicly available text data from across the internet.

    At its core, crawling seems simple:

    1. Start with a list of known URLs (called seed URLs)

    2. Fetch each page

    3. Extract hyperlinks

    4. Add new URLs to the list

    5. Repeat

    However, designing a crawler that can operate at internet scale, processing billions or even trillions of pages, is anything but simple. It introduces several complex engineering challenges like:

    • How do we prioritize which pages to crawl first?

    • How do we ensure we don’t overload the target servers?

    • How do we avoid redundant crawling of the same URL or content?

    • How do we split the work across hundreds or thousands of crawler nodes?

    In this article, we’ll walk through the end-to-end design of a scalable, distributed web crawler. We’ll start with the requirements, map out the high-level architecture, explore database and storage options, and dive deep into the core components.


    1. Requirements

    Before we start drawing boxes and arrows, let’s define what our crawler needs to do.

    1.1 Functional Requirements

    1. Fetch Web Pages: Given a URL, the crawler should be able to download the corresponding content.

    2. Store Content: Save the fetched content for downstream use.

    3. Extract Links: Parse the HTML to discover hyperlinks and identify new URLs to crawl.

    4. Avoid Duplicates: Prevent redundant crawling and storage of the same URL or content. Both URL-level and content-level deduplication should be supported.

    5. Respect robots.txt: Follow site-specific crawling rules defined in robots.txt files, including disallowed paths and crawl delays.

    6. Handle Diverse Content Types: Support HTML as a primary format, but also be capable of recognizing and handling other formats such as PDFs, XML, images, and scripts.

    7. Freshness: Support recrawling of pages based on content volatility. Frequently updated pages should be revisited more often than static ones.

    1.2 Non-Functional Requirements

    1. Scalability: The system should scale horizontally to crawl billions of pages across a large number of domains.

    2. Politeness: The crawler should avoid overwhelming target servers by limiting the rate of requests to each domain.

    3. Extensibility: The architecture should allow for easy integration of new modules, such as custom parsers, content filters, storage backends, or processing pipelines.

    4. Robustness & Fault Tolerance: The crawler should gracefully handle failures whether it’s a bad URL, a timeout, or a crashing worker node without disrupting the overall system.

    5. Performance: The crawler should maintain high throughput (pages per second), while also minimizing fetch latency.

    Note: In a real system design interview, you may only be expected to address a subset of these requirements. Focus on what’s relevant to the problem you’re asked to solve, and clarify assumptions early in the discussion.


    2. Scale Estimation

    2.1 Number of Pages to Crawl

    Assume we aim to crawl a subset of the web, not the entire internet, but a meaningful slice. This includes pages across blogs, news sites, e-commerce platforms, documentation pages, and forums.

    Target: 1 billion pages

    2.2 Data Volume

    • HTML Content: ~100 KB

    • Additional Metadata (headers, timestamps, etc.): ~10 KB

    • Total per page: ~110 KB

    Total Data Volume = 1 billion pages × 110 KB = ~110 TB

    This estimate covers only the raw HTML and metadata. If we store additional data like structured metadata, embedded files, or full-text search indexes, the storage requirements could grow meaningfully.

    2.3 Bandwidth

    Let’s assume we want to complete the crawl in 10 days.

    • Pages per day = 1 billion / 10 ≈ 100 million pages/day

    • Pages per second ≈ 1150 pages/sec

    Bandwidth requirements = 110 KB/page × 1150 pages/sec = ~126 MB/sec

    This means our system must be capable of:

    • Making over 1150 HTTP requests per second

    • Parsing and storing content at the same rate

    2.4 URL Frontier Size

    Every page typically contains several outbound links, many of which are unique. This causes the URL frontier (queue of URLs to visit) to grow rapidly.

    Lets assume:

    • Average outbound links per page: 5

    • New links discovered per second = 1150 (pages per second) * 5 = 5750

    The URL Frontier’s needs to handle thousands of new URL submissions per second. We’ll need efficient URL deduplication, prioritization, and persistence to handle this at scale.


    3. High-Level Architecture

    This post is for paid subscribers

    Already a paid subscriber? Sign in
    Share
    Facebook
    Twitter
    Pinterest
    WhatsApp
      Previous article
      What are JSON Web Tokens (JWTs)?
      Next article
      10 Must-Know Database Types for System Design Interviews
      Algomaster
      Algomasterhttps://blog.algomaster.io
      RELATED ARTICLES
      Guest Blogs

      Interview With Chip Witt – Principal Security Evangelist at Radware by Shauli Zacks

      28 August 2025
      Guest Blogs

      ChatGPT Leaks: We Analyzed 1,000 Public AI Conversations—Here’s What We Found by Shipra Sanganeria

      28 August 2025
      Guest Blogs

      Interview With Itai Goldman – Co-Founder and CTO at Miggo Security by Shauli Zacks

      28 August 2025

      LEAVE A REPLY Cancel reply

      Log in to leave a comment

      Most Popular

      Samsung and Microsoft are bringing Copilot to your living room

      28 August 2025

      Google warns: Change your Gmail password now to stay out of harm’s way

      28 August 2025

      Why the Google Pixel 10 Pro’s storage upgrade is a big deal for speed and longevity

      28 August 2025

      Keep the kids entertained with the Amazon Kindle Kids while it’s on sale for $100

      28 August 2025
      Load more
      Algomaster
      Algomaster
      202 POSTS0 COMMENTS
      https://blog.algomaster.io
      Calisto Chipfumbu
      Calisto Chipfumbu
      6619 POSTS0 COMMENTS
      http://cchipfumbu@gmail.com
      Dominic
      Dominic
      32244 POSTS0 COMMENTS
      http://wardslaus.com
      Milvus
      Milvus
      80 POSTS0 COMMENTS
      https://milvus.io/
      Nango Kala
      Nango Kala
      6615 POSTS0 COMMENTS
      neverop
      neverop
      0 POSTS0 COMMENTS
      https://geeksforgeeks.org
      Nicole Veronica
      Nicole Veronica
      11787 POSTS0 COMMENTS
      Nokonwaba Nkukhwana
      Nokonwaba Nkukhwana
      11833 POSTS0 COMMENTS
      Safety Detectives
      Safety Detectives
      2588 POSTS0 COMMENTS
      https://www.safetydetectives.com/
      Shaida Kate Naidoo
      Shaida Kate Naidoo
      6729 POSTS0 COMMENTS
      Ted Musemwa
      Ted Musemwa
      7010 POSTS0 COMMENTS
      Thapelo Manthata
      Thapelo Manthata
      6684 POSTS0 COMMENTS
      Umr Jansen
      Umr Jansen
      6697 POSTS0 COMMENTS

      EDITOR PICKS

      Samsung and Microsoft are bringing Copilot to your living room

      28 August 2025

      Google warns: Change your Gmail password now to stay out of harm’s way

      28 August 2025

      Why the Google Pixel 10 Pro’s storage upgrade is a big deal for speed and longevity

      28 August 2025

      POPULAR POSTS

      Samsung and Microsoft are bringing Copilot to your living room

      28 August 2025

      Google warns: Change your Gmail password now to stay out of harm’s way

      28 August 2025

      Why the Google Pixel 10 Pro’s storage upgrade is a big deal for speed and longevity

      28 August 2025

      POPULAR CATEGORY

      • Languages45985
      • Data Modelling & AI17565
      • Java15156
      • Android13917
      • Mobile12983
      • Javascript12713
      • Guest Blogs12665
      • Data Structure & Algorithm10077
      Logo

      ABOUT US

      We provide you with the latest breaking news and videos straight from the technology industry.

      Contact us: hello@geeksforgeeks.org

      FOLLOW US

      Blogger
      Facebook
      Flickr
      Instagram
      VKontakte

      © NeverOpen 2022

      • Home
      • News
      • Data Modelling & AI
      • Mobile
      • Languages
      • Guest Blogs
      • Discussion
      • Our Team