Crate Architecture

Forager is organized into six workspace crates. Each crate owns a distinct slice of functionality; dependencies flow downward.

forager            (binary)
  |-- forager-core   (shared types, config, params)
  |-- forager-web    (HTTP fetching, HTML parsing, link discovery)
  |-- forager-ml     (scoring, DQN agent, embeddings)
  |-- forager-frontier (URL selection tree)
  |-- forager-db     (graph database layer + C FFI)

forager

The top-level binary crate. Owns the CLI, command dispatch, pipeline orchestration, and stage execution.

cli – clap-based argument parsing, 8 subcommands.
commands – one module per command (new, import, run, status, tune, log, list, query).
pipeline – round-based crawl loop: select, fetch, parse, score, tune, persist.
stages – individual pipeline stage implementations wiring the other crates together.
util – config resolution, path helpers.

forager-core

Shared types used across the workspace. No heavy dependencies.

config – TOML deserialization structs mirroring the pipeline stages (Config, SelectSettings, FetchSettings, ParseSettings, ScoreSettings, TuneSettings).
param – Param<T> with Fixed, Auto, and Range modes. Supports learn(), merge_from(), and clamping.
param_group – ParamGroup trait: lifecycle for stage-aligned adaptive parameter groups (persist, restore, update).
groups – five param groups, one per configurable stage: ScoreParams, FetchParams, ParseParams, SelectParams, TuneParams.
import – CSV term loading and DB term merge logic.
url – URL normalization and domain extraction.
types – PageScore, FrontierEntry, DomainProfile, PageObservation.

forager-web

Everything related to HTTP and HTML.

fetch – async HTTP client with concurrency control, stealth headers, Cloudflare detection, per-request timeout.
parse – HTML parsing with timeout, text extraction by region (title, headings, body).
discover – link extraction from parsed HTML, URL filter application (substring, domain, regex, allow, max_depth).
extract – CSS selector-based field extraction for structured data.

forager-ml

Machine learning: scoring and the RL agent.

score – keyword matching (flat terms, grouped terms with geometric mean) and semantic similarity scoring (MiniLM-L6-v2, 384-dim).
dqn – Double DQN with Prioritised Experience Replay:
- network – Q-network (11 → 30 → 15 → 1, LeakyReLU).
- buffer – PER buffer with stratified sampling and importance weights.
- agent – batched candidate scoring, DDQN training, target network sync, epsilon schedule.
embedder – sentence embedding via MiniLM on Burn/Wgpu (Vulkan).

forager-frontier

URL selection using a TRES (Tree-based Region Exploration Strategy) tree.

FrontierManager – maintains the tree, samples one candidate per leaf, tracks domain page counts.
tree – CART regression tree partitioning URLs by feature vectors:
- node – tree nodes (leaf/internal), variance reduction split criterion.
- experience – stored observations per leaf (feature vectors + rewards).

forager-db

Persistence layer wrapping GrafeoDB (embedded graph database).

Node types: CrawlRun, Page, Term, Transition, Domain, Model, ParamGroup, Frontier.
HNSW vector index on Page.embedding (384-dim, cosine similarity).
GQL query interface for raw queries via forager query.
Schema initialization, CRUD for each node type, stats helpers.
C FFI (ffi module) – builds as a shared library (libforager_db.so/.dylib/.dll) for direct database access from other languages. Exposes forager_db_open, forager_db_query (returns JSON), forager_string_free, and forager_db_close. Used by the Julia analysis scripts to query the database without going through the CLI.

Keyboard shortcuts