Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Forager is a deep reinforcement learning web crawler designed for niche discovery. Given a description of what you are looking for, it crawls the web, scores what it finds, and learns to find more of it. It was built to solve a specific problem – finding European master’s programmes in process philosophy – but the architecture is general: describe a target, point it at some seed URLs, and let the system refine itself.

How it works

Forager operates as a stage-based pipeline. A DQN agent decides which URLs to fetch next. Each page is scored using a combination of keyword groups and semantic similarity (via a MiniLM-L6-v2 sentence embedding model). High-scoring pages reinforce the agent; low-scoring pages teach it to look elsewhere. The frontier, the scoring weights, and the reference embedding all adapt as the crawl progresses.

You do not need to know how any of this works to use it. The system is config-driven: you write a TOML file describing what you want (search terms, a reference paragraph, seed URLs), and forager learns the rest. Parameters marked mode = "auto" or mode = "range" are adjusted automatically during crawling based on observed results.

Stack

  • Rust – the entire pipeline, CLI, and RL agent
  • Burn (Wgpu/Vulkan backend) – tensor operations for the DQN and embedding inference
  • GrafeoDB – graph database for pages, links, domains, and frontier state
  • MiniLM-L6-v2 – sentence transformer for semantic scoring (auto-downloaded from HuggingFace on first run)

Workflow

Forager uses a package-based workflow. Each crawl lives in its own directory under data/, containing a config file, a database, and learned parameters. You create a package, optionally import terms, run the crawl, and check status through the CLI. All data lives in the graph database and can be queried directly – from the CLI, from Julia via the C FFI, or from Rust as a library.

forager new myproject --reference "..."
forager run myproject
forager status myproject
forager query myproject "MATCH (p:Page) WHERE p.score > 0.3 RETURN p.url, p.score"

The next pages walk through installation and running your first crawl.

Installation

Prerequisites

  • Rust toolchain (stable, 1.75+). Install via rustup if you do not have it.
  • Vulkan-capable GPU. Forager uses Burn with the Wgpu backend, which requires Vulkan drivers. Most discrete GPUs and recent integrated GPUs work. Verify with vulkaninfo or vkcube if unsure.
  • Internet connection for the first run (model download) and for crawling.

Install

Clone the repository and install the binary:

git clone https://github.com/your-org/forager.git
cd forager
cargo install --path crates/forager

This compiles the full workspace and places the forager binary in your Cargo bin directory (usually ~/.cargo/bin/).

Model download

On first run, forager automatically downloads the MiniLM-L6-v2 sentence transformer model from HuggingFace. This is a ~23 MB download. The model is cached locally and reused for all subsequent runs. No manual setup is needed.

Verify

forager --help

Expected output:

Deep RL web crawler

Usage: forager [OPTIONS] <COMMAND>

Commands:
  new     Create a new crawl package
  import  Import terms from a CSV file
  run     Run or resume a crawl
  status  Show crawl state, parameters, and statistics
  tune    Override a learned parameter
  log     Show crawl run history
  list    List all packages in data/
  query   Run a raw GQL query
  help    Print this message or the help of the given subcommand(s)

Options:
  -d, --db <DB>  Override the database path
  -h, --help     Print help

If you see this, the installation is complete. Proceed to Your First Crawl.

Your First Crawl

This walkthrough creates a crawl that searches for European master’s programmes in process philosophy. Adapt the reference text and terms to your own use case.

1. Create a package

forager new phil-ma --reference "European master's programme in process philosophy, \
continental philosophy, metaphysics of becoming. Thinkers: Whitehead, Deleuze, \
Simondon, Bergson. MA or MSc with faculty working on process thought."

This creates data/phil-ma/ with a generated config file. The --reference text becomes the semantic anchor that pages are scored against. You can also start from a hand-written config:

forager new phil-ma --config configs/uni.toml

2. Edit the config

Open data/phil-ma/config.toml and review the generated settings. A minimal config looks like this:

[target]
name = "phil-ma"

[score]
relevance_threshold = { value = 0.10, mode = "auto" }

[[score.groups]]
name = "philosophy"
required = true
weight = 1.0
terms = [
    { text = "process philosophy", weight = 3.0 },
    { text = "Whitehead", weight = 2.5 },
    { text = "continental philosophy", weight = 2.0 },
]

[[score.groups]]
name = "program"
required = true
weight = 2.0
terms = [
    { text = "master programme", weight = 3.0 },
    { text = "MA philosophy", weight = 2.5 },
    { text = "ECTS", weight = 2.0 },
]

[score.semantic]
reference = """
I am looking for a European master's programme in process philosophy
or continental philosophy, focused on becoming, process, and practice.
"""
weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }

[fetch]
seed_urls = [
    "https://www.mastersportal.com/search/master/philosophy/europe",
    "https://www.findamasters.com/masters-degrees/philosophy/",
    "https://philosophicalgourmet.com/overall-rankings/",
]
concurrency = 64
pages_per_round = 50

Key points:

  • Term groups with required = true means a page must match both groups to score well. A Wikipedia article about Whitehead (philosophy only, no programme info) scores low. An MSc in data science (programme only, wrong topic) scores low. An MA in process philosophy matches both groups and scores high.
  • mode = "auto" and mode = "range" mark parameters that the system adjusts during crawling. Fixed values stay fixed.
  • reference is embedded with MiniLM-L6-v2 and compared against page content. The reference embedding itself adapts over time toward high-scoring pages.

3. Import additional terms (optional)

If you have a list of terms in a CSV file, import them:

forager import phil-ma -t terms.csv

The CSV can be simple (one term per line) or structured (group,text,weight with a header row). Example terms.csv:

group,text,weight
philosophy,speculative realism,1.5
philosophy,radical empiricism,1.5
philosophy,non-representational theory,1.0
program,postgraduate,2.0
program,admission,1.5

Imported terms merge with config terms. Config terms always take priority.

4. Run the crawl

forager run phil-ma

The crawler starts fetching seed URLs, scoring pages, training the DQN agent, and expanding the frontier. Output shows round-by-round progress:

[round 1]  fetched 50  scored 47  relevant 3  frontier 892  epsilon 0.98
[round 2]  fetched 50  scored 48  relevant 5  frontier 1304  epsilon 0.95
[round 3]  fetched 50  scored 46  relevant 8  frontier 1687  epsilon 0.91
...
  • fetched – pages downloaded this round
  • scored – pages successfully parsed and scored
  • relevant – pages above the relevance threshold
  • frontier – total URLs queued for future rounds
  • epsilon – DQN exploration rate (starts high, decays toward 0.1)

Stop the crawl with Ctrl-C. It saves state automatically. Resume later:

forager run phil-ma           # resumes the latest run
forager run phil-ma --new     # starts a fresh run (keeps data)

5. Check status

forager status phil-ma

Shows learned parameters, crawl statistics, top domains, and score distributions. Useful for deciding whether to keep running or adjust the config.

6. Query results

Use GQL queries to explore the database directly:

forager query phil-ma "MATCH (p:Page) WHERE p.score > 0.3 RETURN p.url, p.score ORDER BY p.score DESC LIMIT 20"
forager query phil-ma "MATCH (d:Domain) RETURN d.name, d.fetches, d.reward_sum ORDER BY d.fetches DESC"

For richer analysis from Julia (or any language with C FFI), forager-db builds as a shared library. See Database Model for details.

Other useful commands

forager list                        # show all packages
forager log phil-ma                 # run history for this package
forager tune phil-ma semantic_weight 0.8   # override a learned parameter

For a full command reference, see CLI Commands.

Pipeline

Forager’s crawl loop is a five-stage pipeline that runs in rounds. Each round selects URLs, fetches them, parses the HTML, scores relevance, and tunes the model. Every stage feeds back into the others through learning – the crawler gets smarter with each round.

Stage flow

flowchart TD
    init["init — load config, build components"]
    restore["restore — reload frontier + replay buffer"]
    select["select — sample URLs from frontier tree"]
    fetch["fetch — concurrent HTTP requests"]
    parse["parse — extract text + links"]
    score["score — relevance scoring"]
    tune["tune — DQN training + param update"]

    init --> restore --> select
    select --> fetch --> parse --> score --> tune
    tune -->|"next round"| select

    style init fill:#2d2d2d,stroke:#555,color:#eee
    style restore fill:#2d2d2d,stroke:#555,color:#eee
    style select fill:#1a3a4a,stroke:#4a9,color:#eee
    style fetch fill:#1a3a4a,stroke:#4a9,color:#eee
    style parse fill:#1a3a4a,stroke:#4a9,color:#eee
    style score fill:#1a3a4a,stroke:#4a9,color:#eee
    style tune fill:#1a3a4a,stroke:#4a9,color:#eee

Stage overview

Select

Picks the next batch of URLs from the frontier. The TRES tree partitions candidates by their feature vectors and samples one per leaf node, guaranteeing diversity. The DQN agent scores each candidate, and domain profiling filters out slow or unreliable sources.

Fetch

Downloads pages concurrently with stealth headers and adaptive timeouts. Failed fetches produce zero-reward transitions so the agent learns to avoid those URL patterns. Per-domain rate limiting and jitter keep the crawler polite.

Parse

Extracts structured text (title, headings, body) and outbound links from HTML. Anchor text is batch-embedded on GPU for keyword matching. Domain-adaptive truncation caps input size to avoid wasting compute on bloated pages.

Score

Determines how relevant each page is by blending semantic embedding similarity with keyword density. Three text signals (title, heading, body) are weighted independently, and all weights are learnable. Pages above the relevance threshold contribute positive reward.

Tune

Trains the Double DQN agent with prioritised experience replay. Updates all adaptive parameters (scoring weights, fetch timeouts, parse limits, selection thresholds) based on observed outcomes. Blends the reference embedding toward recently found relevant pages.

Startup stages

Before the main loop begins, two one-time stages run:

init reads the configuration file and constructs the scorer, DQN agent, frontier tree, fetcher, and filter chain. All learnable parameters (Param<T>) are initialised from their config values and modes.

restore reloads the frontier tree and candidate queue from the database on resume. Experience transitions with positive reward are loaded back into the replay buffer so the agent retains what it learned.

Feedback loops

The pipeline is not a simple conveyor belt. Each stage produces information that improves the others:

  • Score informs tune: page relevance becomes the DQN reward signal.
  • Tune informs select: updated Q-values change which URLs are prioritised.
  • Parse informs select: discovered child URLs expand the frontier.
  • Fetch informs fetch: domain profiles adjust timeouts and concurrency.
  • Tune informs score: learnable weights (signal blend, thresholds) shift based on what works.

Graceful shutdown

The pipeline honours Ctrl-C at every stage boundary and at every page within a round. State is always persisted before exit, so resuming loses at most the in-flight page.

Select

The select stage decides which URLs to fetch next. Out of potentially thousands of candidates sitting in the frontier, it needs to pick a small batch that balances exploration (trying new domains and paths) with exploitation (fetching URLs the model thinks will score well). It does this through tree-based partitioning, neural scoring, and domain-level filtering.

TRES tree partitioning

The frontier is organized as a CART regression tree (called TRES – Tree-based Reward Estimation for Sampling). Each URL in the frontier has an 11-dimensional feature vector. The tree splits on these features using reward as the regression target, creating leaf nodes that group similar URLs together.

At selection time, forager samples one candidate from each leaf. This guarantees diversity – even if 90% of the frontier is from a single domain, the tree forces the batch to include URLs from different regions of feature space.

DQN scoring

Once candidates are sampled from the tree, the DQN agent scores them all in a single batched forward pass. Each candidate’s feature vector goes through the network and comes out as a Q-value – the model’s estimate of how much reward fetching that URL will produce.

Candidates are ranked by Q-value. The top pages_per_round candidates survive to be fetched.

Domain profiling

Before a candidate makes it into the final batch, domain-level checks filter out bad sources:

  • Slow domains: if a domain’s median response time exceeds slow_domain_ms, it is deprioritised. Life is too short to wait for overloaded servers.
  • Unreliable domains: if a domain’s fetch failure rate exceeds unreliable_threshold, its URLs are skipped. No point hammering a host that keeps returning errors.
  • Page caps: each domain has a maximum page count (domain_max_pages). Once reached, no more URLs from that domain are selected. This prevents any single site from dominating the crawl.
  • Already fetched: URLs that have been fetched before are excluded. The frontier tracks visited URLs by hash.

Feature vector

Every URL in the frontier is described by an 11-dimensional feature vector. These features are the inputs to the DQN and the splitting criteria for the TRES tree.

#FeatureDescription
1depthLink depth from the nearest seed URL
2domain_relevanceAverage score of pages already fetched from this domain
3domain_countNumber of pages fetched from this domain so far
4domain_failure_rateFraction of fetches from this domain that failed
5parent_scoreRelevance score of the page that linked to this URL
6anchor_similarityEmbedding similarity between anchor text and the reference
7sibling_scoreAverage score of other URLs discovered on the same parent page
8path_depthNumber of / segments in the URL path
9has_queryWhether the URL contains query parameters (0 or 1)
10domain_weightUser-configured weight for this domain (from fetch.domain_weights)
11round_discoveredThe crawl round in which this URL was first seen

Selection flow

flowchart TD
    frontier["Frontier tree<br/>(all pending URLs)"]
    sample["Sample 1 candidate<br/>per leaf node"]
    dqn["DQN forward pass<br/>(batched Q-values)"]
    rank["Rank by Q-value"]
    filter["Filter out:<br/>- capped domains<br/>- unreliable domains<br/>- slow domains<br/>- already fetched"]
    batch["Final batch<br/>(pages_per_round URLs)"]

    frontier --> sample --> dqn --> rank --> filter --> batch

    style frontier fill:#2d2d2d,stroke:#555,color:#eee
    style sample fill:#1a3a4a,stroke:#4a9,color:#eee
    style dqn fill:#1a3a4a,stroke:#4a9,color:#eee
    style rank fill:#1a3a4a,stroke:#4a9,color:#eee
    style filter fill:#3a1a1a,stroke:#a44,color:#eee
    style batch fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [select] config reference for all tuneable fields.

Fetch

The fetch stage downloads pages from the selected URLs. It runs requests concurrently while staying polite – rotating user agents, respecting rate limits, and adapting timeouts based on what it observes.

Concurrent HTTP with stealth headers

Forager sends up to concurrency requests in parallel using async HTTP. Each request carries browser-like headers to avoid being blocked by bot detection:

  • User-Agent: rotated across 6 realistic browser UA strings (Chrome, Firefox, Safari on various OSes).
  • Accept, Accept-Language, Accept-Encoding: set to match what a real browser would send.
  • Connection: keep-alive: connections are pooled per host for efficiency.

This is not about deception – it is about not getting blocked by overzealous WAFs that reject anything that does not look like a browser.

Timeouts and jitter

Two timeouts control how long forager waits:

  • timeout_ms: total request timeout (default: 8000ms, adaptive). Covers DNS, connection, TLS, and response body.
  • connect_timeout_ms: just the TCP + TLS handshake (default: 3000ms, adaptive).

Both are adaptive parameters. The crawler adjusts them based on observed response times – specifically, it targets the P95 latency plus a safety margin. If a domain is consistently fast, timeouts tighten. If responses are slow, they relax.

Jitter adds a small random delay (up to jitter_max_ms) between requests to the same domain. This starts low and increases automatically if the crawler detects rate-limiting responses (429s or Cloudflare challenges). The goal is to look less like a bot and more like a human clicking through pages.

Cloudflare detection

When a response comes back with a Cloudflare challenge page (identified by specific status codes and response patterns), forager logs it and adjusts the domain’s profile. Repeated challenges increase jitter for that domain and may eventually mark it as unreliable.

Domain weight adjustments

Each domain tracks a running profile:

  • Median response time: used to detect slow domains.
  • Failure rate: fraction of requests that returned errors, timeouts, or challenges.
  • Pages fetched: compared against domain_max_pages to enforce caps.

These profiles persist across rounds. A domain that was fast and reliable in round 1 keeps its good reputation in round 5, unless things change.

Users can also set explicit domain weights in config (fetch.domain_weights) to boost or suppress specific domains regardless of observed behaviour.

What happens on failure

Failed fetches are not just discarded – they become zero-reward training examples. The DQN learns that URLs with similar features to the failed ones are less worth fetching. This is one of the ways the crawler improves over time: it stops wasting requests on domains and URL patterns that do not work.

Configuration

See the [fetch] config reference for all tuneable fields including seed URLs, concurrency, delay, filters, and adaptive timeout parameters.

Parse

The parse stage extracts useful information from downloaded HTML. It pulls out text for scoring, discovers new URLs for the frontier, and optionally extracts structured fields using CSS selectors.

Text extraction

Each page produces three text signals, extracted separately because they carry different amounts of information:

  • Title: the content of the <title> tag. Usually the most concentrated signal about what a page is about.
  • Headings: text from <h1> through <h6> elements, concatenated. These capture the page’s structure and topic hierarchy.
  • Body: the visible text content after stripping tags, scripts, and styles. The broadest signal but also the noisiest.

These three signals flow into the score stage where they are weighted independently. The weights are learnable – the model figures out whether title or body matters more for your particular crawl.

Every <a href> in the page is a potential new URL for the frontier. The parser extracts all outbound links and normalises them (resolving relative URLs, stripping fragments). Links are filtered through the configured filter chain before being added to the frontier.

Anchor text – the clickable text of a link – is particularly valuable. If the anchor text says “PhD programme in philosophy”, that is a strong hint about what is on the other side of that link.

Anchor embedding

Links whose anchor text matches any of your configured keywords get their anchor text embedded using the same MiniLM model used for scoring. This is done in a batched GPU pass (anchor_batch_size controls the batch size) for efficiency.

The resulting embedding similarity becomes the anchor_similarity feature in the URL’s feature vector, giving the DQN a semantic signal about whether a link is worth following.

Field extraction

For structured crawling, you can define CSS selectors to extract specific fields from pages:

[[parse.extract_fields]]
name = "tuition"
selector = ".tuition-fee"
attribute = "text"

This is optional and primarily useful when you want to pull structured data (prices, dates, names) out of pages alongside the relevance scoring.

Adaptive HTML truncation

Not all pages are equal in size. Some domains serve lean HTML; others dump megabytes of JavaScript and boilerplate. The parser enforces a per-domain HTML size limit that adapts over time:

  • max_html_bytes: hard cap on how much HTML to process (default: 512KB).
  • html_limit_factor: a learnable multiplier applied per domain. Domains with useful content in large pages get a higher factor; domains where the first few KB contain everything useful get a lower one.

This prevents the crawler from wasting time parsing bloated pages that do not contribute useful text.

Propagation threshold

Not every discovered link deserves to enter the frontier. The propagation_threshold parameter sets a minimum parent score – if the page that contains the link scored below this threshold, its child links are not added to the frontier.

This is adaptive: early in the crawl when the model is still learning, the threshold is low (let everything through). As the crawler gets better at predicting relevance, the threshold rises, keeping the frontier focused on promising regions of the web.

Configuration

See the [parse] config reference for all tuneable fields.

Score

The score stage determines how relevant each fetched page is to your search. It combines two complementary approaches – semantic embedding similarity and keyword density – into a single score between 0 and 1.

Semantic similarity

Forager uses MiniLM-L6-v2 to compute sentence embeddings. Your reference description is embedded once at startup, and each page’s text is embedded at score time. Relevance is the cosine similarity between the two.

This captures meaning rather than exact words. A reference about “European master’s programme in philosophy” will match pages that say “MA in continental thought” even though there is no keyword overlap.

Anti-reference

You can also provide an anti-reference – a description of what you do not want. Pages similar to the anti-reference get penalised. This is useful when your topic overlaps with something you want to exclude (e.g., you want process philosophy but not analytic philosophy).

The anti-reference penalty is subtracted from the similarity score, weighted by anti_weight.

Multi-signal blend

Page text is not treated as a single blob. The parser extracts three signals – title, heading, and body – and each gets its own embedding and similarity score. The final semantic affinity is a weighted sum:

affinity = w_title * sim_title + w_heading * sim_heading + w_body * sim_body

The weights (w_title, w_heading, w_body) are learnable parameters that sum to 1. The model discovers whether title or body is more informative for your particular crawl. For academic programme pages, title tends to matter most. For blog posts, body might dominate.

Keyword matching

Keyword scoring uses term groups. Each group contains related terms with individual weights. A page’s keyword density for a group is the weighted count of matching terms normalised by text length.

When multiple groups are marked required, their scores are combined with a geometric mean. This means a page must match all required groups to score well – matching just one group drives the score toward zero.

density = (group_1_score * group_2_score * ... * group_n_score) ^ (1/n)

This is powerful for intersection queries: “pages about philosophy AND master’s programmes” eliminates Wikipedia articles (philosophy only) and generic university listings (programmes only).

Final score formula

The two signals are blended with a learnable weight:

score = sem_w * affinity + (1 - sem_w) * density

Where sem_w is the semantic weight (default 0.7, learnable in range 0.3–0.9). When keywords are weak, the model can lean on embeddings, and vice versa.

Language penalty

If you configure target languages (e.g., languages = ["en", "de"]), pages detected in other languages receive a penalty. The lang_penalty parameter (default: 0.0, adaptive) controls how harshly off-language pages are penalised. The model learns the right penalty strength – in multilingual crawls it stays low; in monolingual crawls it ramps up.

Reference blending

The reference embedding is not static. After each round, it drifts slightly toward the embeddings of pages that scored well. The reference_blend parameter controls the drift rate:

reference = (1 - blend) * reference + blend * mean(relevant_embeddings)

This lets the crawler refine its understanding of what “relevant” means based on what it actually finds. If your initial reference was vague, the embedding sharpens over time as the crawler discovers concrete examples.

Scoring pipeline

flowchart TD
    page["Fetched page"]
    extract["Extract title,<br/>headings, body"]
    embed["Embed each signal<br/>(MiniLM-L6-v2)"]
    sim["Cosine similarity<br/>vs reference"]
    anti["Anti-reference<br/>penalty"]
    blend["Signal blend<br/>(learnable weights)"]
    kw["Keyword density<br/>(term groups)"]
    final["Final score<br/>sem_w * affinity +<br/>(1 - sem_w) * density"]
    lang["Language<br/>penalty"]
    out["Page score<br/>(0 to 1)"]

    page --> extract --> embed --> sim
    sim --> blend
    anti --> blend
    blend --> final
    kw --> final
    final --> lang --> out

    style page fill:#2d2d2d,stroke:#555,color:#eee
    style extract fill:#1a3a4a,stroke:#4a9,color:#eee
    style embed fill:#1a3a4a,stroke:#4a9,color:#eee
    style sim fill:#1a3a4a,stroke:#4a9,color:#eee
    style anti fill:#3a1a1a,stroke:#a44,color:#eee
    style blend fill:#1a3a4a,stroke:#4a9,color:#eee
    style kw fill:#1a3a4a,stroke:#4a9,color:#eee
    style final fill:#1a3a1a,stroke:#4a4,color:#eee
    style lang fill:#3a1a1a,stroke:#a44,color:#eee
    style out fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [score] config reference for all tuneable fields including term groups, semantic settings, and signal weights.

Tune

The tune stage is where the crawler learns. After each round of fetching and scoring, it trains the DQN agent on the outcomes and updates all adaptive parameters. This is the feedback loop that makes forager improve over time rather than just repeating the same strategy.

Double DQN

Forager uses a Double DQN architecture to avoid the overestimation bias that plagues standard DQN:

  • Online network: selects the best action (which URL to fetch). This is the network that gets trained every round.
  • Target network: evaluates the value of that action. This is a frozen copy of the online network, updated less frequently (target_update_freq rounds).

The separation means the agent does not chase its own optimistic estimates. The online network picks what it thinks is best; the target network provides a more conservative value estimate.

Network architecture

The DQN is a small feedforward network:

Input (11) → Linear(30) → LeakyReLU → Linear(15) → LeakyReLU → Linear(1)

11 input features (the URL feature vector), two hidden layers of 30 and 15 units with LeakyReLU activations, and a single output: the estimated Q-value.

This is deliberately compact. The feature vector already encodes a lot of domain knowledge (depth, domain stats, parent score, anchor similarity), so the network’s job is to learn a relatively simple value function over those features.

Prioritised Experience Replay

Not all training examples are equally useful. A transition where the model was badly wrong (high TD error) teaches more than one where the prediction was already close.

Prioritised Experience Replay (PER) samples transitions with probability proportional to their priority:

priority = |TD_error| + epsilon

Where epsilon (per_epsilon, default: 1e-4) prevents any transition from having zero probability. The per_alpha parameter (default: 0.6) controls how strongly priority affects sampling – 0 gives uniform sampling, 1 gives fully prioritised sampling.

Importance sampling correction

Prioritised sampling introduces bias because high-priority transitions are over-represented. Importance sampling weights correct for this:

w_i = (N * P(i)) ^ (-beta)

The beta parameter anneals from a starting value toward 1.0 over the course of training. Early on, the correction is weak (the agent learns fast but with some bias). Later, it becomes exact (unbiased gradient updates for fine-tuning).

Epsilon schedule

The agent balances exploration and exploitation using an epsilon-greedy strategy:

  • With probability epsilon, pick a random URL (exploration).
  • With probability 1 - epsilon, pick the URL with the highest Q-value (exploitation).

Epsilon decays linearly from epsilon.start to epsilon.end over epsilon.decay_steps rounds. Early rounds explore broadly; later rounds exploit what the model has learned.

Adaptive parameter updates

The DQN is not the only thing that learns. All Param<T> parameters across every stage update once per round during tune:

  • Scoring: signal weights, relevance threshold, semantic weight
  • Fetch: timeouts, jitter, redirect limits
  • Parse: link limits, propagation threshold, embed threshold
  • Select: domain caps, slow/unreliable thresholds
  • Tune: PER alpha, epsilon schedule

Parameters in auto mode are adjusted based on observed outcomes. Parameters in fixed mode stay constant. Parameters in range mode are bounded but learnable.

Training loop

flowchart TD
    results["Round results<br/>(URL, score, features)"]
    store["Store transitions<br/>in replay buffer"]
    sample["PER: sample batch<br/>by priority"]
    forward["Online net:<br/>select best action"]
    target["Target net:<br/>evaluate Q-value"]
    td["Compute TD error<br/>+ IS weights"]
    backprop["Backprop +<br/>update online net"]
    update_prio["Update priorities<br/>in replay buffer"]
    update_target{"Update target net?<br/>(every N rounds)"}
    params["Update all<br/>adaptive params"]
    blend["Blend reference<br/>embedding"]

    results --> store --> sample
    sample --> forward --> target --> td
    td --> backprop --> update_prio
    update_prio --> update_target
    update_target -->|"yes"| params
    update_target -->|"no"| params
    params --> blend

    style results fill:#2d2d2d,stroke:#555,color:#eee
    style store fill:#1a3a4a,stroke:#4a9,color:#eee
    style sample fill:#1a3a4a,stroke:#4a9,color:#eee
    style forward fill:#1a3a4a,stroke:#4a9,color:#eee
    style target fill:#1a3a4a,stroke:#4a9,color:#eee
    style td fill:#1a3a4a,stroke:#4a9,color:#eee
    style backprop fill:#1a3a1a,stroke:#4a4,color:#eee
    style update_prio fill:#1a3a4a,stroke:#4a9,color:#eee
    style update_target fill:#2d2d2d,stroke:#555,color:#eee
    style params fill:#1a3a1a,stroke:#4a4,color:#eee
    style blend fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [tune] config reference for all tuneable fields including replay buffer size, learning rate, epsilon schedule, and PER parameters.

Configuration Overview

Forager uses a single TOML file per crawl package, stored at data/{name}/{name}.toml. The config structure mirrors the pipeline stages:

SectionPurpose
[target]Crawl identity (package name)
[score]What to look for: keywords, groups, semantics
[fetch]How to get pages: seeds, concurrency, filters
[extract]What to pull from pages: CSS selectors
[frontier]Frontier management: tree depth, domain caps
[dqn]RL agent: replay buffer, epsilon, training

Param<T> syntax

All numeric parameters support three modes via Param<T>:

Fixed – plain scalar, never changes:

relevance_threshold = 0.5

Auto – starts at the given value, the learner adjusts it freely:

relevance_threshold = { value = 0.1, mode = "auto" }

Range – learner adjusts within clamped bounds:

weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }

Fixed params ignore learn() calls entirely. Auto and range params update when the statistical learner or DQN agent provides new values.

Minimal config

Only [target], [score], and [fetch] are required. All other sections have defaults:

[target]
name = "my-crawl"

[score]
terms = [{ text = "ecology", weight = 2.0 }]

[fetch]
seed_urls = ["https://example.com"]

File layout

data/
  my-crawl/
    my-crawl.toml      # config
    my-crawl.grafeo     # database (auto-created)

[target] – Target Configuration

The [target] section identifies your crawl. It has a single field.

Fields

FieldTypeDefaultDescription
namestrA short name for this crawl target

The name is used for the database file, log prefixes, and output directories. Keep it short and filesystem-friendly.

Example

[target]
name = "process-philosophy-masters"

[select] – Select Configuration

The [select] section controls how URLs are sampled from the frontier. It governs the TRES tree structure and domain-level filtering. See Select for how these parameters fit into the selection process.

Fields

[select]
max_depth             = 18
min_samples_per_split = 5
domain_max_pages      = { value = 100, mode = "auto" }
slow_domain_ms        = { value = 3000.0, mode = "auto" }
unreliable_threshold  = { value = 0.3, mode = "auto" }
html_limit_factor     = { value = 2.0, mode = "auto" }
FieldTypeDefaultDescription
max_depthusize18Maximum depth of the TRES regression tree
min_samples_per_splitusize5Minimum samples in a leaf before it can split
domain_max_pagesParam<usize>auto(100)Maximum pages to fetch from any single domain
slow_domain_msParam<f64>auto(3000)Response time threshold (ms) above which a domain is “slow”
unreliable_thresholdParam<f64>auto(0.3)Failure rate above which a domain is considered unreliable
html_limit_factorParam<f64>auto(2.0)Per-domain multiplier for the HTML size cap

Tree parameters

max_depth and min_samples_per_split control the TRES tree shape. A deeper tree creates more leaf nodes, which means more diverse sampling but fewer candidates per leaf. If your frontier is small (under a few hundred URLs), lower max_depth to avoid over-partitioning.

Domain filtering

domain_max_pages prevents any single site from dominating the crawl. In auto mode, the crawler adjusts this cap based on how productive each domain is – domains that consistently produce relevant pages get a higher cap.

slow_domain_ms and unreliable_threshold define what counts as a bad domain. Both are adaptive: if the crawler finds that being stricter improves results, it tightens these thresholds.

HTML limit factor

html_limit_factor scales the global max_html_bytes on a per-domain basis. A factor of 2.0 means domains are allowed up to twice the base limit. The crawler adjusts this per domain based on whether larger pages from that domain tend to contain useful content.

[fetch] – Fetch Configuration

The [fetch] section controls how the crawler retrieves pages: seed URLs, concurrency, rate limiting, and URL filters. See Fetch for how these parameters work in practice.

Fields

[fetch]
seed_urls = [
    "https://www.mastersportal.com/search/master/philosophy/europe",
    "https://www.findamasters.com/masters-degrees/philosophy/",
]
pages_per_round  = 50
concurrency      = 32
delay_ms         = { min = 200, max = 500 }
languages        = ["en", "de"]
timeout_ms       = { value = 8000, mode = "auto" }
connect_timeout_ms = { value = 3000, mode = "auto" }
max_redirects    = { value = 3, mode = "fixed" }
pool_idle_per_host = { value = 8, mode = "fixed" }
jitter_max_ms    = { value = 50, mode = "auto" }
FieldTypeDefaultDescription
seed_urls[str]Starting URLs for the crawl (required)
pages_per_roundusize50Pages fetched per crawl round
concurrencyusize32Max simultaneous HTTP requests
delay_ms.minu64200Minimum delay between requests (ms)
delay_ms.maxu64500Maximum delay between requests (ms)
languages[str][]Accepted languages (empty = all)
timeout_msParam<u64>auto(8000)Total request timeout (adaptive to P95 latency)
connect_timeout_msParam<u64>auto(3000)TCP + TLS handshake timeout
max_redirectsParam<usize>fixed(3)Maximum HTTP redirects to follow
pool_idle_per_hostParam<usize>fixed(8)Idle connections to keep per host
jitter_max_msParam<u64>auto(50)Max random jitter added between requests

Adaptive timeouts

timeout_ms and connect_timeout_ms are adaptive by default. The crawler tracks response times per domain and adjusts toward the P95 latency plus a safety margin. Fast domains get tight timeouts; slow domains get more room.

Jitter

jitter_max_ms starts low and increases if the crawler encounters rate-limiting responses (HTTP 429 or Cloudflare challenges). This helps avoid triggering anti-bot protections.

[[fetch.domain_weights]]

Boost or suppress specific domains regardless of their observed behaviour.

[[fetch.domain_weights]]
domain = "mastersportal.com"
weight = 2.0

[[fetch.domain_weights]]
domain = "reddit.com"
weight = 0.1

[[fetch.filters]]

URL filters are applied before fetching. Each filter has a type field that selects the variant. Five types are available:

substring

Reject URLs containing any of the listed substrings.

[[fetch.filters]]
type = "substring"
patterns = [".pdf", ".jpg", "login", "action=edit"]

domain

Block entire domains.

[[fetch.filters]]
type = "domain"
block = ["facebook.com", "twitter.com", "youtube.com"]

regex

Reject URLs matching a regex pattern.

[[fetch.filters]]
type = "regex"
pattern = "\\d{4}/\\d{2}/\\d{2}"

allow

Only allow URLs matching a regex pattern. All non-matching URLs are rejected.

[[fetch.filters]]
type = "allow"
pattern = "\\.edu|university|master"

max_depth

Reject URLs beyond a given link depth from seeds.

[[fetch.filters]]
type = "max_depth"
max = 8

[parse] – Parse Configuration

The [parse] section controls how HTML is processed after fetching: text extraction limits, link discovery, anchor embedding, and optional field extraction. See Parse for how these parameters fit into the parsing pipeline.

Fields

[parse]
parse_timeout_ms      = 500
max_html_bytes        = 512000
max_links_per_page    = { value = 200, mode = "auto" }
anchor_batch_size     = { value = 512, mode = "fixed" }
embed_threshold_factor = { value = 0.5, mode = "auto" }
propagation_threshold = { value = 0.0, mode = "auto" }
FieldTypeDefaultDescription
parse_timeout_msu64500Timeout for HTML parsing per page (ms)
max_html_bytesusize512000Max HTML body size to process (~500 KB)
max_links_per_pageParam<usize>auto(200)Max outbound links to extract per page
anchor_batch_sizeParam<usize>fixed(512)Batch size for GPU anchor text embedding
embed_threshold_factorParam<f64>auto(0.5)Minimum keyword match score for an anchor to be embedded
propagation_thresholdParam<f64>auto(0.0)Minimum parent score for child links to enter the frontier

max_links_per_page caps how many outbound links are extracted from a single page. In auto mode, the crawler adjusts this based on whether pages with many links tend to produce useful frontier candidates. Pages with navigation menus listing hundreds of links often produce diminishing returns.

Anchor embedding

anchor_batch_size controls how many anchor texts are embedded in a single GPU pass. Larger batches are more efficient but use more memory. embed_threshold_factor sets how closely an anchor’s text must match your keywords before it is worth embedding – anchors below this threshold get a zero similarity score without the GPU cost.

Propagation threshold

propagation_threshold gates whether a page’s child links are added to the frontier. At 0.0 (the default), all links propagate. As the model learns, it raises this threshold to keep the frontier focused on promising regions. Low-scoring pages stop contributing new URLs.

Optional: extract_url_pattern

A regex pattern for extracting URLs from page text (not just <a> tags). Useful for pages that embed URLs in JavaScript or plain text.

[parse]
extract_url_pattern = "https://example\\.com/programs/\\d+"

[[parse.extract_fields]]

Define CSS selectors to extract structured fields from pages.

[[parse.extract_fields]]
name = "tuition"
selector = ".tuition-fee"
attribute = "text"

[[parse.extract_fields]]
name = "deadline"
selector = "span.deadline"
attribute = "text"

[[parse.extract_fields]]
name = "apply_link"
selector = "a.apply-button"
attribute = "href"
FieldTypeDefaultDescription
namestrField name in the output
selectorstrCSS selector to match elements
attributestrtextWhich attribute to extract (text, href, etc.)

[score] – Scoring Configuration

The [score] section defines what the crawler considers relevant. Scoring combines keyword matching (term groups) with semantic similarity (embeddings). See Score for a detailed explanation of how scoring works.

Fields

FieldTypeDefaultDescription
terms[KeywordTerm][]Flat term list (used when no groups)
groups[[TermGroup]][]Multiplicative term groups with weights
relevance_thresholdParam<f64>auto(0.1)Minimum score to count a page as relevant
lang_penaltyParam<f64>auto(0.0)Score penalty for pages in non-target languages

When groups is non-empty, flat terms are ignored. A page must match all required groups to score well – groups are multiplicative, not additive.

Language penalty

lang_penalty is applied when a page is detected in a language not listed in fetch.languages. In auto mode, the model learns the right penalty strength. For monolingual crawls it tends to ramp up; for multilingual crawls it stays low. A value of 0.0 means no penalty.

Term import

The terms_file field has been removed. Use forager import to load terms from CSV into the database. Config terms always take priority over DB-imported terms during merge.

KeywordTerm

{ text = "process philosophy", weight = 3.0 }
FieldTypeDefaultDescription
textstrSearch phrase
weightf641.0Multiplier for this term

TermGroup

[[score.groups]]
name = "philosophy"
required = true
weight = 1.0
terms = [
    { text = "process philosophy", weight = 3.0 },
    { text = "Whitehead", weight = 2.5 },
]
FieldTypeDefaultDescription
namestrGroup identifier
requiredbooltruePage must match this group to score well
weightf641.0Group-level multiplier
terms[KeywordTerm]Terms belonging to this group

Group example

Two required groups ensure pages must match both topic and program type:

[[score.groups]]
name = "philosophy"
required = true
weight = 1.0
terms = [
    { text = "process philosophy", weight = 3.0 },
    { text = "Whitehead", weight = 2.5 },
    { text = "continental philosophy", weight = 2.0 },
]

[[score.groups]]
name = "program"
required = true
weight = 2.0
terms = [
    { text = "master programme", weight = 3.0 },
    { text = "ECTS", weight = 2.0 },
    { text = "postgraduate", weight = 1.5 },
]

A Wikipedia article on Whitehead (philosophy only, no program terms) scores near zero. An MA in Computer Science (program only) scores near zero. An MA in Process Philosophy (both groups) scores high.

[score.semantic]

Embedding-based similarity using a reference description and optional anti-reference.

[score.semantic]
reference = """
I am looking for a European master's programme in process philosophy...
"""
anti_reference = """
Analytic philosophy focused on formal logic...
"""
weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }
anti_weight = { value = 0.3, mode = "range", min = 0.1, max = 0.5 }
max_text_len = 2000
reference_blend = { value = 0.1, mode = "range", min = 0.0, max = 0.3 }
FieldTypeDefaultDescription
referencestrNatural language description of what you want
anti_referencestr?NoneDescription of what you do not want
weightParam<f64>range(0.7, 0.3, 0.9)How much semantic similarity counts in final score
anti_weightParam<f64>range(0.3, 0.1, 0.5)Penalty weight for anti-reference similarity
max_text_lenusize2000Max characters of page text sent to embedder
reference_blendParam<f64>range(0.1, 0.0, 0.3)Blend rate: how fast reference adapts toward relevant pages. 0.0 = static, 1.0 = fully replace each round

[score.semantic.signals]

Controls how much weight each page region gets in the semantic score.

[score.semantic.signals]
title   = { value = 0.4, mode = "auto" }
heading = { value = 0.3, mode = "auto" }
body    = { value = 0.3, mode = "auto" }
FieldTypeDefaultDescription
titleParam<f64>auto(0.4)Weight for <title> content
headingParam<f64>auto(0.3)Weight for heading elements
bodyParam<f64>auto(0.3)Weight for body text

[tune] – Tune Configuration

The [tune] section controls the DQN training loop and experience replay. See Tune for a detailed explanation of the learning process.

Fields

[tune]
replay_capacity    = 50000
batch_size         = 64
learning_rate      = 0.001
gamma              = 0.99
lr_decay           = 0.995
replay_period      = 4
target_update_freq = 100
min_replay_size    = 500
per_alpha          = { value = 0.6, mode = "auto" }
per_epsilon        = { value = 0.0001, mode = "fixed" }

[tune.epsilon]
start       = 1.0
end         = 0.05
decay_steps = 500
FieldTypeDefaultDescription
replay_capacityusize50000Maximum transitions stored in the replay buffer
batch_sizeusize64Transitions sampled per training step
learning_ratef640.001Adam optimiser learning rate
gammaf640.99Discount factor for future rewards
lr_decayf640.995Learning rate decay multiplier applied each round
replay_periodusize4Train every N transitions (not every round)
target_update_frequsize100Rounds between target network updates
min_replay_sizeusize500Minimum buffer size before training begins
per_alphaParam<f64>auto(0.6)PER prioritisation exponent (0 = uniform, 1 = full)
per_epsilonParam<f64>fixed(1e-4)Small constant added to TD error to prevent zero priority

Replay buffer

The replay buffer stores (state, action, reward, next_state) transitions from every page fetched. Once it reaches replay_capacity, old transitions are evicted. Training does not start until the buffer has at least min_replay_size entries, ensuring the agent has enough experience for meaningful gradient updates.

Learning rate

learning_rate is the initial rate for the Adam optimiser. lr_decay multiplies it after each round, producing exponential decay. This lets the agent make large updates early (when it knows little) and fine-tune later.

Discount factor

gamma controls how much the agent values future rewards versus immediate ones. At 0.99, a reward 100 steps in the future is worth about 37% of an immediate reward. Lower values make the agent more short-sighted.

[tune.epsilon]

The epsilon-greedy exploration schedule.

FieldTypeDefaultDescription
startf641.0Initial exploration rate (100% random)
endf640.05Final exploration rate
decay_stepsusize500Rounds over which epsilon decays

Epsilon decays linearly from start to end over decay_steps rounds. At start = 1.0, every action in round 1 is random. By the time decay_steps rounds have passed, only 5% of actions are random.

PER parameters

per_alpha controls how aggressively the replay buffer favours high-error transitions. At 0.0, all transitions are sampled uniformly. At 1.0, sampling is fully proportional to TD error. The default of 0.6 is a common sweet spot. In auto mode, the crawler adjusts this based on training stability.

per_epsilon is a small constant added to every transition’s priority so that no transition ever has exactly zero probability of being sampled. This is fixed at 1e-4 and rarely needs changing.

CLI Commands

All commands accept a package name as a positional argument. The package name resolves to data/{name}/{name}.toml. Use --db <path> globally to override the database path.

forager [--db <path>] <command> [args]

new

Create a new crawl package.

forager new <name> [--config <path>] [--reference <text>] [--from <pkg>]
  • --config copies an existing TOML as the starting config.
  • --reference generates a config from a natural language description (no --config).
  • --from clones an existing package (config + DB state).
forager new ecology-masters --reference "European masters in ecology and conservation"
forager new phil-v2 --from process-philosophy-masters

import

Import terms from a CSV file into the package database.

forager import <pkg> --terms <csv> [--replace]
  • CSV format (full): group,text,weight with header row.
  • CSV format (simple): one term per line, assigned to group “imported” with weight 1.0.
  • --replace clears existing DB terms before importing.
  • Config terms always take priority over imported terms at merge time.
forager import my-crawl --terms keywords.csv
forager import my-crawl --terms extra.csv --replace

run

Run or resume a crawl.

forager run <pkg> [--resume <run-id>] [--new] [--verbose]
  • Default behavior: resumes the latest run if one exists, otherwise starts a new run.
  • --resume resumes a specific run by its UUID.
  • --new forces a fresh run even if a previous one exists.
  • --verbose prints per-round timing breakdowns.
forager run my-crawl
forager run my-crawl --new --verbose

status

Show crawl state, learned parameters, and statistics.

forager status <pkg>
forager status my-crawl

tune

Override a learned parameter value.

forager tune <pkg> <param> <value>
  • <value> can be a number, "auto", or "fixed".
forager tune my-crawl relevance_threshold 0.15
forager tune my-crawl semantic_weight auto

log

Show crawl run history.

forager log [<pkg>]
  • Without a package name, shows all runs across all packages.
forager log my-crawl
forager log

list

List all packages in data/.

forager list

query

Run a raw GQL query against the package database.

forager query <pkg> <gql>
forager query my-crawl "MATCH (p:Page) WHERE p.score > 0.5 RETURN p.url, p.score ORDER BY p.score DESC LIMIT 10"

Crate Architecture

Forager is organized into six workspace crates. Each crate owns a distinct slice of functionality; dependencies flow downward.

forager            (binary)
  |-- forager-core   (shared types, config, params)
  |-- forager-web    (HTTP fetching, HTML parsing, link discovery)
  |-- forager-ml     (scoring, DQN agent, embeddings)
  |-- forager-frontier (URL selection tree)
  |-- forager-db     (graph database layer + C FFI)

forager

The top-level binary crate. Owns the CLI, command dispatch, pipeline orchestration, and stage execution.

  • cli – clap-based argument parsing, 8 subcommands.
  • commands – one module per command (new, import, run, status, tune, log, list, query).
  • pipeline – round-based crawl loop: select, fetch, parse, score, tune, persist.
  • stages – individual pipeline stage implementations wiring the other crates together.
  • util – config resolution, path helpers.

forager-core

Shared types used across the workspace. No heavy dependencies.

  • config – TOML deserialization structs mirroring the pipeline stages (Config, SelectSettings, FetchSettings, ParseSettings, ScoreSettings, TuneSettings).
  • paramParam<T> with Fixed, Auto, and Range modes. Supports learn(), merge_from(), and clamping.
  • param_groupParamGroup trait: lifecycle for stage-aligned adaptive parameter groups (persist, restore, update).
  • groups – five param groups, one per configurable stage: ScoreParams, FetchParams, ParseParams, SelectParams, TuneParams.
  • import – CSV term loading and DB term merge logic.
  • url – URL normalization and domain extraction.
  • typesPageScore, FrontierEntry, DomainProfile, PageObservation.

forager-web

Everything related to HTTP and HTML.

  • fetch – async HTTP client with concurrency control, stealth headers, Cloudflare detection, per-request timeout.
  • parse – HTML parsing with timeout, text extraction by region (title, headings, body).
  • discover – link extraction from parsed HTML, URL filter application (substring, domain, regex, allow, max_depth).
  • extract – CSS selector-based field extraction for structured data.

forager-ml

Machine learning: scoring and the RL agent.

  • score – keyword matching (flat terms, grouped terms with geometric mean) and semantic similarity scoring (MiniLM-L6-v2, 384-dim).
  • dqn – Double DQN with Prioritised Experience Replay:
    • network – Q-network (11 → 30 → 15 → 1, LeakyReLU).
    • buffer – PER buffer with stratified sampling and importance weights.
    • agent – batched candidate scoring, DDQN training, target network sync, epsilon schedule.
  • embedder – sentence embedding via MiniLM on Burn/Wgpu (Vulkan).

forager-frontier

URL selection using a TRES (Tree-based Region Exploration Strategy) tree.

  • FrontierManager – maintains the tree, samples one candidate per leaf, tracks domain page counts.
  • tree – CART regression tree partitioning URLs by feature vectors:
    • node – tree nodes (leaf/internal), variance reduction split criterion.
    • experience – stored observations per leaf (feature vectors + rewards).

forager-db

Persistence layer wrapping GrafeoDB (embedded graph database).

  • Node types: CrawlRun, Page, Term, Transition, Domain, Model, ParamGroup, Frontier.
  • HNSW vector index on Page.embedding (384-dim, cosine similarity).
  • GQL query interface for raw queries via forager query.
  • Schema initialization, CRUD for each node type, stats helpers.
  • C FFI (ffi module) – builds as a shared library (libforager_db.so/.dylib/.dll) for direct database access from other languages. Exposes forager_db_open, forager_db_query (returns JSON), forager_string_free, and forager_db_close. Used by the Julia analysis scripts to query the database without going through the CLI.

Graph Database Model

Forager uses GrafeoDB, an embedded graph database. All state is stored in data/{name}/{name}.grafeo. The schema consists of nine node types and one edge type.

The database can be queried in three ways:

  • From the CLIforager query <pkg> "<GQL>" for ad-hoc queries from the terminal.
  • From Julia/C/etc.forager-db builds as a shared library (libforager_db.so) with a C FFI. See Crates for the API.
  • From Rust — use the Db struct directly via forager-db as a library dependency.

Node types

:CrawlRun

One node per crawl execution.

PropertyTypeDescription
uidstringUUID identifying this run
config_namestringPackage name from [target].name
started_atstringISO 8601 timestamp
finished_atstringISO 8601 timestamp (set on finish)
statusstring"running", "finished", "failed"
pages_crawledintTotal pages fetched in this run

:Page

One node per fetched page. Linked to its CrawlRun via [:BELONGS_TO].

PropertyTypeDescription
uidstringUUID
crawl_run_uidstringForeign key to CrawlRun
urlstringNormalized URL
status_codeintHTTP status code
htmlstringRaw HTML body
fetched_atstringISO 8601 timestamp
scorefloatRelevance score (set after scoring)
term_hitsstringJSON array of matched term texts
scored_atstringISO 8601 timestamp
embeddingvector384-dim float vector (sentence embedding)
extract_{name}stringExtracted field values (one per extract field)

:Term

Imported or config-defined search terms, stored per package.

PropertyTypeDescription
config_namestringPackage name
groupstringTerm group name
textstringThe search phrase
weightfloatTerm weight
embeddingvector384-dim embedding of the term text

:Transition

DQN training data. Each transition records a state, reward, and available next actions.

PropertyTypeDescription
config_namestringPackage name
featuresstringComma-separated float vector (11-dim)
rewardfloatObserved reward for this transition
next_actionsstringJSON array of available next action features

:Domain

Per-domain crawl statistics, saved after each run.

PropertyTypeDescription
config_namestringPackage name
namestringDomain name (e.g., example.com)
fetchesintTotal fetch attempts
successesintSuccessful fetches (HTTP 2xx)
reward_sumfloatCumulative reward from this domain
avg_fetch_msfloatAverage fetch latency
avg_parse_msfloatAverage parse time
avg_html_bytesfloatAverage response body size

:Model

Persisted DQN network weights and training state. One node per package (replaced on save).

PropertyTypeDescription
config_namestringPackage name
dqn_weightsstringSerialized network weights
epsilonfloatCurrent epsilon value
stepsintTotal training steps completed

:ParamGroup

Persisted adaptive parameter groups. One node per (package, stage) pair.

PropertyTypeDescription
config_namestringPackage name
group_keystringStage key (score, fetch, parse, select, tune)
jsonstringJSON-serialized param group + learner state

:Frontier

Persisted frontier tree state. One node per package (replaced on save).

PropertyTypeDescription
config_namestringPackage name
tree_jsonstringJSON-serialized frontier tree

Edge types

[:BELONGS_TO]

Connects :Page to :CrawlRun. Direction: (Page)-[:BELONGS_TO]->(CrawlRun).

Indexes

  • HNSW vector index on Page.embedding: 384 dimensions, cosine similarity. Created by Db::init_schema(). Enables approximate nearest-neighbor queries over page embeddings.

Technical Reference

Complete specification of every computation, formula, and parameter in the forager pipeline.

Pipeline Overview

flowchart TB
    subgraph ROUND["Each Round"]
        direction TB
        SELECT["<b>Select</b><br/>DQN ranks frontier candidates<br/>Filter: already fetched, domain capped, unreliable"]
        FETCH["<b>Fetch</b><br/>Concurrent HTTP with stealth headers<br/>Per-request timeout + jitter from params"]
        PARSE["<b>Parse</b><br/>Parallel HTML → title, headings, body<br/>Adaptive truncation per domain"]
        SCORE["<b>Score</b><br/>Semantic + keyword blend<br/>Language + domain weight adjustments"]
        DISCOVER["<b>Discover</b><br/>Extract links, batch anchor embeddings<br/>Skip GPU for low-scoring pages"]
        PROCESS["<b>Process</b><br/>Compute 11-dim feature vectors<br/>Build children if score ≥ propagation_threshold"]
        INTEGRATE["<b>Integrate</b><br/>DB write, learner observations<br/>Domain profiles, DQN training"]
        LEARN["<b>Learn</b><br/>All param groups update<br/>Reference embedding refines"]
        PERSIST["<b>Persist</b><br/>Param groups, frontier tree<br/>DQN model, domain profiles"]

        SELECT --> FETCH --> PARSE --> SCORE --> DISCOVER --> PROCESS --> INTEGRATE --> LEARN --> PERSIST
    end

    PERSIST -.->|"next round"| SELECT

Scoring System

Total Score

flowchart LR
    subgraph SEMANTIC["Semantic Similarity"]
        REF["Reference<br/>embedding<br/>(384-dim)"]
        ANTI["Anti-reference<br/>embedding"]
        TITLE["title embedding"]
        HEAD["heading embedding"]
        BODY["body embedding"]

        REF --> |"cosine sim"| TA["title_aff"]
        REF --> |"cosine sim"| HA["heading_aff"]
        REF --> |"cosine sim"| BA["body_aff"]
        ANTI --> |"penalty"| TA & HA & BA
    end

    subgraph KEYWORD["Keyword Matching"]
        TERMS["term groups<br/>(required + optional)"]
        TERMS --> DENSITY["keyword_density"]
    end

    TA & HA & BA --> BLEND["signal blend"]
    BLEND --> AFFINITY["affinity"]
    AFFINITY & DENSITY --> TOTAL["total_score"]

Formulas

Per-signal affinity (with anti-reference penalty):

affinity(signal) = max(0, cos(reference, signal_emb) - anti_w × cos(anti_ref, signal_emb))

Multi-signal blend (learnable weights, sum to 1.0):

semantic_affinity = tw × title_aff + hw × heading_aff + bw × body_aff

Total score (semantic + keyword blend):

total = clamp(sem_w × semantic_affinity + (1 - sem_w) × keyword_density, 0, 1)

Keyword density — flat terms:

density = min(1.0, Σ(count_i × weight_i) / word_count × 100)

Keyword density — with term groups (required groups use geometric mean):

required_score = exp(Σ(group_weight × ln(group_density)) / Σ(group_weight))
total = min(1.0, required_score + optional_sum × 0.1)

If any required group has zero density → entire score = 0.

Score Adjustments (pipeline)

adjusted_score = raw_score × lang_factor × domain_factor

lang_factor = lang_penalty  (if page language ∉ accepted languages)
            = 1.0           (otherwise)

domain_factor = domain_weights[domain]  (if configured)
              = 1.0                     (otherwise)

Reference Blending (end of round)

centroid = mean(relevant_page_embeddings)
reference = (1 - blend) × reference + blend × centroid
reference = reference / ‖reference‖

Feature Vector

11-dimensional feature vector computed per discovered link. This is the DQN’s input.

flowchart LR
    subgraph FEATURES["Feature Vector [0..10]"]
        direction TB
        F0["<b>[0]</b> parent_relevant<br/>1.0 if parent scored above threshold, else 0.0"]
        F1["<b>[1]</b> inverse_distance_to_relevant<br/>1 / (hops_since_relevant_ancestor + 1)"]
        F2["<b>[2]</b> path_relevance_ratio<br/>relevant_ancestors / total_ancestors"]
        F3["<b>[3]</b> keyword_in_url<br/>1.0 if any term appears in URL, else 0.0"]
        F4["<b>[4]</b> keyword_in_anchor<br/>1.0 if any term appears in anchor text, else 0.0"]
        F5["<b>[5]</b> anchor_relevance<br/>cosine sim of anchor embedding vs reference"]
        F6["<b>[6]</b> domain_reward<br/>avg reward for this domain (0.0 if unknown)"]
        F7["<b>[7]</b> domain_novelty<br/>1.0 if domain never seen, else 0.0"]
        F8["<b>[8]</b> ancestor_depth<br/>min(total_ancestors + 1, 20) / 20"]
        F9["<b>[9]</b> required_signal_proximity<br/>1 / (min(hops_since_required_match, 20) + 1)"]
        F10["<b>[10]</b> parent_semantic_distance<br/>cosine sim of parent body embedding vs reference"]
    end

All features are pre-scaled to approximately [0, 1]. No additional normalization.

DQN Agent

Network Architecture

flowchart LR
    IN["features<br/>[batch, 11]"] --> L1["Linear(11 → 30)"] --> A1["LeakyReLU<br/>α = 0.1"] --> L2["Linear(30 → 15)"] --> A2["LeakyReLU<br/>α = 0.1"] --> L3["Linear(15 → 1)"] --> Q["Q-value"]

LeakyReLU: f(x) = max(x, 0.1 × x)

Double DQN Training

flowchart TB
    SAMPLE["Sample batch from PER buffer<br/>(stratified by priority)"]
    SAMPLE --> ARGMAX["<b>Online net:</b> find best next action<br/>a* = argmax_a Q_online(s', a)<br/>(batched forward pass)"]
    ARGMAX --> TARGET["<b>Target net:</b> evaluate best action<br/>Q_target(s', a*)<br/>(batched forward pass)"]
    TARGET --> TD["<b>TD target:</b><br/>y = r + γ × Q_target(s', a*)"]
    TD --> LOSS["<b>Weighted MSE loss:</b><br/>L = mean(w_i × (Q_online(s) - y)²)<br/>w_i = importance sampling weights"]
    LOSS --> BACKWARD["Backward + Adam optimizer step"]
    BACKWARD --> PRIORITY["Update priorities:<br/>priority_i = |td_error_i| + ε"]

TD target formula:

y_i = r_i + γ × Q_target(s', argmax_a Q_online(s', a))
y_i = r_i                                                  (if no next actions)

Training schedule:

train every:      replay_period steps (default: 3)
min buffer size:  min_replay_size (default: 64)
batch size:       batch_size (default: 60)
target sync:      every target_update_freq steps (default: 500)
LR decay:         current_lr *= lr_decay at each target sync

Epsilon schedule:

if steps < decay_steps:
    ε = ε_start + (ε_end - ε_start) × steps / decay_steps
else:
    ε = ε_end

Prioritised Experience Replay

Sampling probability:

P(i) = priority_i^α / Σ priority_j^α

Stratified sampling: divide the cumulative distribution into batch_size equal segments, sample one from each.

Importance sampling weights (bias correction):

w_i = (N × P(i))^(-β) / max_j(w_j)

β = min(1.0, 0.4 + 0.6 × steps / decay_steps)

β anneals from 0.4 → 1.0 over training. At β=1.0, full bias correction.

TRES Tree Frontier

flowchart TB
    ROOT["Root leaf<br/>(all URLs)"]
    ROOT -->|"split on feature[k] at threshold t"| LEFT["Left leaf<br/>feature[k] ≤ t"]
    ROOT --> RIGHT["Right leaf<br/>feature[k] > t"]
    LEFT -->|"further split"| LL["..."] & LR["..."]

    SAMPLE["<b>Candidate selection:</b><br/>1 random URL per leaf<br/>O(leaves) instead of O(frontier)"]

Split Criterion (CART regression on reward)

variance(values) = Σ(v - mean)² / n

variance_reduction = var(parent) - (n_L/n) × var(left) - (n_R/n) × var(right)

Split on the (feature, threshold) pair that maximizes variance reduction.

Constraints:

min child size = max(min_samples_per_split, ⌊0.15 × n_parent⌋)
max tree depth = max_depth (from config)

Domain Capping

domain_fetch_count[d] += 1 per fetch
capped = domain_fetch_count[d] ≥ domain_max_pages

Capped domains are excluded from candidate selection.

Adaptive Parameter Learning

All learning requires a minimum number of observations before updating. Each group runs update() once per round.

[score] Parameters

flowchart LR
    subgraph SCORE_LEARN["Score Learning (per round)"]
        OBS["Page observations<br/>(reservoir sampled, max 2000)"]
        OBS --> RT["<b>relevance_threshold</b><br/>P25 of nonzero scores<br/>(needs ≥ 20 obs, ≥ 10 nonzero)"]
        OBS --> SW["<b>signal weights</b><br/>proportional to which signal<br/>best predicts relevance<br/>(needs ≥ 10 signal hits)"]
        OBS --> SEM["<b>semantic_weight</b><br/>avg_semantic / (avg_semantic + avg_keyword)<br/>among relevant pages<br/>(needs ≥ 5 relevant)"]
    end

Signal weight normalization:

tw = title_hits / total_signal_hits
hw = heading_hits / total_signal_hits
bw = body_hits / total_signal_hits
normalize: tw, hw, bw = tw/sum, hw/sum, bw/sum

[fetch] Parameters

ParameterFormulaConditions
timeout_msclamp(P95(durations) × 1.5, 1000, 30000)≥ 20 fetches, ≥ 10 durations
connect_timeout_msclamp(P50(durations) × 0.75, 500, 10000)≥ 20 fetches, ≥ 10 durations
jitter_max_mscurrent × 1.5 if >5% rate-limited, × 1.2 if >1%, × 0.9 if safe≥ 20 fetches
max_links_per_pageclamp(avg_links × 2, 50, 500)≥ 10 pages, >10% change
embed_threshold_factorlinear map: kw_ratio 0.05→0.3, 0.30→0.7≥ 10 pages
propagation_threshold0.0 if >30% yield, 0.02 if >10%, 0.05 otherwise≥ 10 low-score pages

[select] Parameters

ParameterFormulaConditions
domain_max_pagesmedian exhaustion point: rolling-5 avg drops below 50% of mean≥ 3 domains with ≥ 5 pages
slow_domain_msclamp(P90(latencies), 500, 15000)≥ 10 observations, ≥ 10 latencies
unreliable_thresholdclamp(P10(success_rates), 0.1, 0.6)≥ 10 observations, ≥ 5 rates
html_limit_factornot yet learned (config knob)

[tune] Parameters

ParameterFormulaConditions
per_alphaclamp(0.3 + (CV - 0.5) / 1.5 × 0.5, 0.3, 0.8) where CV = σ/μ of TD errors≥ 20 steps, ≥ 10 errors
per_epsilonnot learned (fixed priority floor)

Domain Profiling

flowchart LR
    FETCH_EVENT["fetch event"] --> PROFILE["DomainProfile"]
    PROFILE --> AVG_F["avg_fetch_ms<br/>running mean"]
    PROFILE --> AVG_P["avg_parse_ms<br/>running mean"]
    PROFILE --> AVG_H["avg_html_bytes<br/>running mean"]
    PROFILE --> SR["success_rate<br/>successes / fetches"]
    PROFILE --> AR["avg_reward<br/>reward_sum / fetches"]

    AVG_F & AVG_P --> SLOW{"is_slow?<br/>fetch+parse > slow_domain_ms<br/>(after ≥ 3 fetches)"}
    SR --> UNRELIABLE{"is_unreliable?<br/>success_rate < threshold<br/>(after ≥ 5 fetches)"}
    AVG_H --> HTML{"html_limit<br/>avg_bytes × factor<br/>clamped [50KB, 2MB]<br/>(default 512KB until ≥ 3 fetches)"}

Data Flow

flowchart TB
    CONFIG["TOML Config<br/>user intent + defaults"] --> INIT["init()"]
    DB_RESTORE["DB: ParamGroup nodes<br/>DQN model, frontier tree"] --> INIT

    INIT --> STATE["PipelineState"]

    STATE --> |"each round"| PIPELINE["Pipeline Loop"]

    PIPELINE --> |"observations"| SCORE_G["ScoreParams.observe()"]
    PIPELINE --> |"observations"| FETCH_G["FetchParams.observe()"]
    PIPELINE --> |"observations"| FRONT_G["SelectParams.observe()"]
    PIPELINE --> |"TD errors"| DQN_G["TuneParams.observe()"]

    SCORE_G & FETCH_G & FRONT_G & DQN_G --> |"group.update()"| LEARN_STEP["End-of-round learning"]

    LEARN_STEP --> |"group.to_json()"| DB_SAVE["DB: save_param_group()"]

    DB_SAVE --> |"next run"| DB_RESTORE

All 22 Adaptive Parameters

SectionParameterDefaultModeWhat it controls
selectdomain_max_pages100autoPer-domain page cap
slow_domain_ms3000autoLatency threshold for “slow”
unreliable_threshold0.3autoSuccess rate floor for “unreliable”
html_limit_factor2.0autoMultiplier on avg HTML for truncation
fetchtimeout_ms8000autoHTTP request timeout
connect_timeout_ms3000autoTCP connect timeout
max_redirects3fixedRedirect limit
pool_idle_per_host8fixedConnection pool size
jitter_max_ms50autoRandom delay before requests
parsemax_links_per_page200autoLink extraction cap
anchor_batch_size512fixedGPU batch cap for anchor embeddings
embed_threshold_factor0.5autoMin page score for anchor GPU work
propagation_threshold0.0autoMin score to propagate children
scorerelevance_threshold0.1autoScore cutoff for “relevant”
semantic_weight0.7range [0.3, 0.9]Semantic vs keyword blend
anti_weight0.3range [0.1, 0.5]Anti-reference penalty strength
title_weight0.4autoTitle signal importance
heading_weight0.3autoHeading signal importance
body_weight0.3autoBody signal importance
reference_blend0.1range [0.0, 0.3]Reference adaptation speed
lang_penalty0.0autoScore multiplier for wrong language
tuneper_alpha0.6autoPER prioritisation exponent
per_epsilon1e-4fixedPER priority floor