System Design Case Study 6 - Design A Web Crawler

  1. A crawler is used for many purpose:
    • Search engine index
    • Web archiving: collect information from the web to preserve data for future use;
    • Web mining: Discover useful knowledge from the internet.
    • Web monitoring: monitor copyright and trademark infringements over the internet.

1. Understand the requirements/problem and establish design scope

  1. What is the crawler used for(or what content is it crawling from the internet)? -> search engine index
  2. how many pages -> 1 billion
  3. What contents are included? -> html only
  4. Single server or distributed?
  5. How long would be the service run? -> 5 years
  6. how to handle duplicates contents from page? -> ignore
  7. What is the data volume of scraping? This depends the storage size.
  8. Public service or private /internal use? Can tolerate some availability?
  9. Handle The anti-crawler service

 10.  The function should be simple:

Give a set of urls, download all the webpages addressed by the url

Extract urls from the web pages

Add new URLs to the lists of urls to be downloaded, repeat 1,2,3

11. Besides funcationalities, these characteristics are also need to be considered: 

Scalability: The web is large, should utilize paralization

Robustness. Handle edges cases: bad HTML, unresponsive servers, crashes , malicious links, etc.

Politness: should not make too many requests to a site,

Extensibility: if we want to support new content, we do not need to redesign the whole system. Minimal the change

Back of the envelope estimation
  1. 1 billion page are downloaded every month
    1. So the QPS would be = 1 billion / 30 days/24h/3600s ~= 386 query/sec
  2. Assume each page is 500K.
    1. The storage would be 1 billion*500KB = 5*10^11 KB = 5*10^8 MB=5*10^5 GB=500 TB
  3. The data is stored 5 years, it would require 500TB * 12 months * 5 years = 30 PB

2. Propose high level design and get buy in

  1. We have a initial url list to craw
  2. The server send requests to the urls sites
  3. Validate valid html contest , The server parse the response and added new urls to the list to be requested; saver save the contents to database,
    1. The urls list was saved to a in memory cache like Redis as a List(queue)
image-20230408135504116
image-20230408135740594
  1. seed urls
    1. How to choose ? A school content-> index page
    2. Entire web -> by country or by topic
  2. URL frontier : The component that stores the urls to be downloaded
  3. URL Downloader
  4. DNS resolver? Why we need to resolve to DNS?
  5. Parser
  6. Content seen? -> to illuminate the repeat content
  7. Content storage
  8. URL extractor(maybe could combine with parser)
  9. URL filter: filter out certain content types, error links...
  10. URL seen
  11. URL storage

3. Design deep dive

we will discuss the most important building components and techniques in depth:

  • Depth-first search (DFS) vs Breadth-first search (BFS)
  • URL frontier
  • HTML Downloader
  • Robustness
  • Extensibility
  • Detect and avoid problematic content

DFS vs BFS

  1. DFS is not good because the depth may be very deep

  2. BFS is commonly used and implemented by queue.

  3. The problem is that the url we get using BFS, most of them are from same host, which will cause "impoliteness"(request same host multiple time in short period of time)

    Figure 5
  4. Another problem is that, normal BFS doesn't take priority into account when traveling. (Some pages are more important)

URL frontier

The frontier (a data structure to save the urls to be downloaded) can help address the politeness problem, priority problem and freshness.

Politeness
  1. By adding a delay between to download tasks.
  2. By implement a mapping from hostname to a download thread. Each thread has a separate FIFO queue and only download URLs from that queue.
Figure 6
  • Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host.
  • Mapping table: It maps each host to a queue.
Host Queue
wikipedia.com b1
apple.com b2
... ...
nike.com bn
  • FIFO queues b1, b2 to bn: Each queue contains URLs from the same host.
  • Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector.
  • Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks.
Priority
  1. Web paged can be prioritized by traffic, update frequency, etc.
Figure 8
freshness
  1. Web pages are being added, updated and deleted. Re-crawl to keep our data refresh.
  2. Executed by strategy
    1. Update frequency
    2. Important pages
Storage from URL frontier
  1. In real world there could be millions of frontiers, keep things in memory is not feasable
  2. The majority are stored on disk, to reduce the cost, maintain a buffer in memory and periodically write to disk.

HTML downloader

  1. Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.
  2. Download from host first to avoid repeat download
Performance optimization
  1. Distributed crawl

    Figure 9
  2. Cache DNS resolver

    • This may be a bottleneck due to the synchronous nature of many DNS interface.
    • We keep a mapping cache and updated it using a cron job.
  3. Locality

    Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time.

  4. Short timeout

    Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified.

Robustness

  1. Consistent hashing to keep distribute the load to different downloaders.
  2. Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data.
  3. Exception handling

Extensibility

image-20230408144722839

Detect and avoid problematic content

  1. Redundant content : use hash or checksums
  2. Spider traps: A spider trap is a web page that causes a crawler in an infinite loop.
    1. Define max length -> not very good
    2. Manually identify unusual long urls
  3. Data noise : no value data

4. Wrap up

  1. Server-side rendering: Numerous websites use scripts like JavaScript, AJAX, etc to generate links on the fly. If we download and parse web pages directly, we will not be able to retrieve dynamically generated links. To solve this problem, we perform server-side rendering (also called dynamic rendering) first before parsing a page.
  2. Filter out unwanted pages
  3. Database replication and sharding
  4. ...