Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration Overview

Forager uses a single TOML file per crawl package, stored at data/{name}/{name}.toml. The config structure mirrors the pipeline stages:

SectionPurpose
[target]Crawl identity (package name)
[score]What to look for: keywords, groups, semantics
[fetch]How to get pages: seeds, concurrency, filters
[extract]What to pull from pages: CSS selectors
[frontier]Frontier management: tree depth, domain caps
[dqn]RL agent: replay buffer, epsilon, training

Param<T> syntax

All numeric parameters support three modes via Param<T>:

Fixed – plain scalar, never changes:

relevance_threshold = 0.5

Auto – starts at the given value, the learner adjusts it freely:

relevance_threshold = { value = 0.1, mode = "auto" }

Range – learner adjusts within clamped bounds:

weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }

Fixed params ignore learn() calls entirely. Auto and range params update when the statistical learner or DQN agent provides new values.

Minimal config

Only [target], [score], and [fetch] are required. All other sections have defaults:

[target]
name = "my-crawl"

[score]
terms = [{ text = "ecology", weight = 2.0 }]

[fetch]
seed_urls = ["https://example.com"]

File layout

data/
  my-crawl/
    my-crawl.toml      # config
    my-crawl.grafeo     # database (auto-created)