[fetch] – Fetch Configuration

The [fetch] section controls how the crawler retrieves pages: seed URLs, concurrency, rate limiting, and URL filters. See Fetch for how these parameters work in practice.

Fields

[fetch]
seed_urls = [
    "https://www.mastersportal.com/search/master/philosophy/europe",
    "https://www.findamasters.com/masters-degrees/philosophy/",
]
pages_per_round  = 50
concurrency      = 32
delay_ms         = { min = 200, max = 500 }
languages        = ["en", "de"]
timeout_ms       = { value = 8000, mode = "auto" }
connect_timeout_ms = { value = 3000, mode = "auto" }
max_redirects    = { value = 3, mode = "fixed" }
pool_idle_per_host = { value = 8, mode = "fixed" }
jitter_max_ms    = { value = 50, mode = "auto" }

Field	Type	Default	Description
`seed_urls`	`[str]`	–	Starting URLs for the crawl (required)
`pages_per_round`	`usize`	`50`	Pages fetched per crawl round
`concurrency`	`usize`	`32`	Max simultaneous HTTP requests
`delay_ms.min`	`u64`	`200`	Minimum delay between requests (ms)
`delay_ms.max`	`u64`	`500`	Maximum delay between requests (ms)
`languages`	`[str]`	`[]`	Accepted languages (empty = all)
`timeout_ms`	`Param<u64>`	`auto(8000)`	Total request timeout (adaptive to P95 latency)
`connect_timeout_ms`	`Param<u64>`	`auto(3000)`	TCP + TLS handshake timeout
`max_redirects`	`Param<usize>`	`fixed(3)`	Maximum HTTP redirects to follow
`pool_idle_per_host`	`Param<usize>`	`fixed(8)`	Idle connections to keep per host
`jitter_max_ms`	`Param<u64>`	`auto(50)`	Max random jitter added between requests

timeout_ms and connect_timeout_ms are adaptive by default. The crawler tracks response times per domain and adjusts toward the P95 latency plus a safety margin. Fast domains get tight timeouts; slow domains get more room.

Jitter

jitter_max_ms starts low and increases if the crawler encounters rate-limiting responses (HTTP 429 or Cloudflare challenges). This helps avoid triggering anti-bot protections.

[[fetch.domain_weights]]

Boost or suppress specific domains regardless of their observed behaviour.

[[fetch.domain_weights]]
domain = "mastersportal.com"
weight = 2.0

[[fetch.domain_weights]]
domain = "reddit.com"
weight = 0.1

[[fetch.filters]]

URL filters are applied before fetching. Each filter has a type field that selects the variant. Five types are available:

substring

Reject URLs containing any of the listed substrings.

[[fetch.filters]]
type = "substring"
patterns = [".pdf", ".jpg", "login", "action=edit"]

domain

Block entire domains.

[[fetch.filters]]
type = "domain"
block = ["facebook.com", "twitter.com", "youtube.com"]

regex

Reject URLs matching a regex pattern.

[[fetch.filters]]
type = "regex"
pattern = "\\d{4}/\\d{2}/\\d{2}"

allow

Only allow URLs matching a regex pattern. All non-matching URLs are rejected.

[[fetch.filters]]
type = "allow"
pattern = "\\.edu|university|master"

max_depth

Reject URLs beyond a given link depth from seeds.

[[fetch.filters]]
type = "max_depth"
max = 8

Forager