Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Fetch

The fetch stage downloads pages from the selected URLs. It runs requests concurrently while staying polite – rotating user agents, respecting rate limits, and adapting timeouts based on what it observes.

Concurrent HTTP with stealth headers

Forager sends up to concurrency requests in parallel using async HTTP. Each request carries browser-like headers to avoid being blocked by bot detection:

  • User-Agent: rotated across 6 realistic browser UA strings (Chrome, Firefox, Safari on various OSes).
  • Accept, Accept-Language, Accept-Encoding: set to match what a real browser would send.
  • Connection: keep-alive: connections are pooled per host for efficiency.

This is not about deception – it is about not getting blocked by overzealous WAFs that reject anything that does not look like a browser.

Timeouts and jitter

Two timeouts control how long forager waits:

  • timeout_ms: total request timeout (default: 8000ms, adaptive). Covers DNS, connection, TLS, and response body.
  • connect_timeout_ms: just the TCP + TLS handshake (default: 3000ms, adaptive).

Both are adaptive parameters. The crawler adjusts them based on observed response times – specifically, it targets the P95 latency plus a safety margin. If a domain is consistently fast, timeouts tighten. If responses are slow, they relax.

Jitter adds a small random delay (up to jitter_max_ms) between requests to the same domain. This starts low and increases automatically if the crawler detects rate-limiting responses (429s or Cloudflare challenges). The goal is to look less like a bot and more like a human clicking through pages.

Cloudflare detection

When a response comes back with a Cloudflare challenge page (identified by specific status codes and response patterns), forager logs it and adjusts the domain’s profile. Repeated challenges increase jitter for that domain and may eventually mark it as unreliable.

Domain weight adjustments

Each domain tracks a running profile:

  • Median response time: used to detect slow domains.
  • Failure rate: fraction of requests that returned errors, timeouts, or challenges.
  • Pages fetched: compared against domain_max_pages to enforce caps.

These profiles persist across rounds. A domain that was fast and reliable in round 1 keeps its good reputation in round 5, unless things change.

Users can also set explicit domain weights in config (fetch.domain_weights) to boost or suppress specific domains regardless of observed behaviour.

What happens on failure

Failed fetches are not just discarded – they become zero-reward training examples. The DQN learns that URLs with similar features to the failed ones are less worth fetching. This is one of the ways the crawler improves over time: it stops wasting requests on domains and URL patterns that do not work.

Configuration

See the [fetch] config reference for all tuneable fields including seed URLs, concurrency, delay, filters, and adaptive timeout parameters.