[fetch] – Fetch Configuration
The [fetch] section controls how the crawler retrieves pages: seed URLs, concurrency, rate limiting, and URL filters. See Fetch for how these parameters work in practice.
Fields
[fetch]
seed_urls = [
"https://www.mastersportal.com/search/master/philosophy/europe",
"https://www.findamasters.com/masters-degrees/philosophy/",
]
pages_per_round = 50
concurrency = 32
delay_ms = { min = 200, max = 500 }
languages = ["en", "de"]
timeout_ms = { value = 8000, mode = "auto" }
connect_timeout_ms = { value = 3000, mode = "auto" }
max_redirects = { value = 3, mode = "fixed" }
pool_idle_per_host = { value = 8, mode = "fixed" }
jitter_max_ms = { value = 50, mode = "auto" }
| Field | Type | Default | Description |
|---|---|---|---|
seed_urls | [str] | – | Starting URLs for the crawl (required) |
pages_per_round | usize | 50 | Pages fetched per crawl round |
concurrency | usize | 32 | Max simultaneous HTTP requests |
delay_ms.min | u64 | 200 | Minimum delay between requests (ms) |
delay_ms.max | u64 | 500 | Maximum delay between requests (ms) |
languages | [str] | [] | Accepted languages (empty = all) |
timeout_ms | Param<u64> | auto(8000) | Total request timeout (adaptive to P95 latency) |
connect_timeout_ms | Param<u64> | auto(3000) | TCP + TLS handshake timeout |
max_redirects | Param<usize> | fixed(3) | Maximum HTTP redirects to follow |
pool_idle_per_host | Param<usize> | fixed(8) | Idle connections to keep per host |
jitter_max_ms | Param<u64> | auto(50) | Max random jitter added between requests |
Adaptive timeouts
timeout_ms and connect_timeout_ms are adaptive by default. The crawler tracks response times per domain and adjusts toward the P95 latency plus a safety margin. Fast domains get tight timeouts; slow domains get more room.
Jitter
jitter_max_ms starts low and increases if the crawler encounters rate-limiting responses (HTTP 429 or Cloudflare challenges). This helps avoid triggering anti-bot protections.
[[fetch.domain_weights]]
Boost or suppress specific domains regardless of their observed behaviour.
[[fetch.domain_weights]]
domain = "mastersportal.com"
weight = 2.0
[[fetch.domain_weights]]
domain = "reddit.com"
weight = 0.1
[[fetch.filters]]
URL filters are applied before fetching. Each filter has a type field that selects the variant. Five types are available:
substring
Reject URLs containing any of the listed substrings.
[[fetch.filters]]
type = "substring"
patterns = [".pdf", ".jpg", "login", "action=edit"]
domain
Block entire domains.
[[fetch.filters]]
type = "domain"
block = ["facebook.com", "twitter.com", "youtube.com"]
regex
Reject URLs matching a regex pattern.
[[fetch.filters]]
type = "regex"
pattern = "\\d{4}/\\d{2}/\\d{2}"
allow
Only allow URLs matching a regex pattern. All non-matching URLs are rejected.
[[fetch.filters]]
type = "allow"
pattern = "\\.edu|university|master"
max_depth
Reject URLs beyond a given link depth from seeds.
[[fetch.filters]]
type = "max_depth"
max = 8