Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

[fetch] – Fetch Configuration

The [fetch] section controls how the crawler retrieves pages: seed URLs, concurrency, rate limiting, and URL filters. See Fetch for how these parameters work in practice.

Fields

[fetch]
seed_urls = [
    "https://www.mastersportal.com/search/master/philosophy/europe",
    "https://www.findamasters.com/masters-degrees/philosophy/",
]
pages_per_round  = 50
concurrency      = 32
delay_ms         = { min = 200, max = 500 }
languages        = ["en", "de"]
timeout_ms       = { value = 8000, mode = "auto" }
connect_timeout_ms = { value = 3000, mode = "auto" }
max_redirects    = { value = 3, mode = "fixed" }
pool_idle_per_host = { value = 8, mode = "fixed" }
jitter_max_ms    = { value = 50, mode = "auto" }
FieldTypeDefaultDescription
seed_urls[str]Starting URLs for the crawl (required)
pages_per_roundusize50Pages fetched per crawl round
concurrencyusize32Max simultaneous HTTP requests
delay_ms.minu64200Minimum delay between requests (ms)
delay_ms.maxu64500Maximum delay between requests (ms)
languages[str][]Accepted languages (empty = all)
timeout_msParam<u64>auto(8000)Total request timeout (adaptive to P95 latency)
connect_timeout_msParam<u64>auto(3000)TCP + TLS handshake timeout
max_redirectsParam<usize>fixed(3)Maximum HTTP redirects to follow
pool_idle_per_hostParam<usize>fixed(8)Idle connections to keep per host
jitter_max_msParam<u64>auto(50)Max random jitter added between requests

Adaptive timeouts

timeout_ms and connect_timeout_ms are adaptive by default. The crawler tracks response times per domain and adjusts toward the P95 latency plus a safety margin. Fast domains get tight timeouts; slow domains get more room.

Jitter

jitter_max_ms starts low and increases if the crawler encounters rate-limiting responses (HTTP 429 or Cloudflare challenges). This helps avoid triggering anti-bot protections.

[[fetch.domain_weights]]

Boost or suppress specific domains regardless of their observed behaviour.

[[fetch.domain_weights]]
domain = "mastersportal.com"
weight = 2.0

[[fetch.domain_weights]]
domain = "reddit.com"
weight = 0.1

[[fetch.filters]]

URL filters are applied before fetching. Each filter has a type field that selects the variant. Five types are available:

substring

Reject URLs containing any of the listed substrings.

[[fetch.filters]]
type = "substring"
patterns = [".pdf", ".jpg", "login", "action=edit"]

domain

Block entire domains.

[[fetch.filters]]
type = "domain"
block = ["facebook.com", "twitter.com", "youtube.com"]

regex

Reject URLs matching a regex pattern.

[[fetch.filters]]
type = "regex"
pattern = "\\d{4}/\\d{2}/\\d{2}"

allow

Only allow URLs matching a regex pattern. All non-matching URLs are rejected.

[[fetch.filters]]
type = "allow"
pattern = "\\.edu|university|master"

max_depth

Reject URLs beyond a given link depth from seeds.

[[fetch.filters]]
type = "max_depth"
max = 8