Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Crate Architecture

Forager is organized into six workspace crates. Each crate owns a distinct slice of functionality; dependencies flow downward.

forager            (binary)
  |-- forager-core   (shared types, config, params)
  |-- forager-web    (HTTP fetching, HTML parsing, link discovery)
  |-- forager-ml     (scoring, DQN agent, embeddings)
  |-- forager-frontier (URL selection tree)
  |-- forager-db     (graph database layer + C FFI)

forager

The top-level binary crate. Owns the CLI, command dispatch, pipeline orchestration, and stage execution.

  • cli – clap-based argument parsing, 8 subcommands.
  • commands – one module per command (new, import, run, status, tune, log, list, query).
  • pipeline – round-based crawl loop: select, fetch, parse, score, tune, persist.
  • stages – individual pipeline stage implementations wiring the other crates together.
  • util – config resolution, path helpers.

forager-core

Shared types used across the workspace. No heavy dependencies.

  • config – TOML deserialization structs mirroring the pipeline stages (Config, SelectSettings, FetchSettings, ParseSettings, ScoreSettings, TuneSettings).
  • paramParam<T> with Fixed, Auto, and Range modes. Supports learn(), merge_from(), and clamping.
  • param_groupParamGroup trait: lifecycle for stage-aligned adaptive parameter groups (persist, restore, update).
  • groups – five param groups, one per configurable stage: ScoreParams, FetchParams, ParseParams, SelectParams, TuneParams.
  • import – CSV term loading and DB term merge logic.
  • url – URL normalization and domain extraction.
  • typesPageScore, FrontierEntry, DomainProfile, PageObservation.

forager-web

Everything related to HTTP and HTML.

  • fetch – async HTTP client with concurrency control, stealth headers, Cloudflare detection, per-request timeout.
  • parse – HTML parsing with timeout, text extraction by region (title, headings, body).
  • discover – link extraction from parsed HTML, URL filter application (substring, domain, regex, allow, max_depth).
  • extract – CSS selector-based field extraction for structured data.

forager-ml

Machine learning: scoring and the RL agent.

  • score – keyword matching (flat terms, grouped terms with geometric mean) and semantic similarity scoring (MiniLM-L6-v2, 384-dim).
  • dqn – Double DQN with Prioritised Experience Replay:
    • network – Q-network (11 → 30 → 15 → 1, LeakyReLU).
    • buffer – PER buffer with stratified sampling and importance weights.
    • agent – batched candidate scoring, DDQN training, target network sync, epsilon schedule.
  • embedder – sentence embedding via MiniLM on Burn/Wgpu (Vulkan).

forager-frontier

URL selection using a TRES (Tree-based Region Exploration Strategy) tree.

  • FrontierManager – maintains the tree, samples one candidate per leaf, tracks domain page counts.
  • tree – CART regression tree partitioning URLs by feature vectors:
    • node – tree nodes (leaf/internal), variance reduction split criterion.
    • experience – stored observations per leaf (feature vectors + rewards).

forager-db

Persistence layer wrapping GrafeoDB (embedded graph database).

  • Node types: CrawlRun, Page, Term, Transition, Domain, Model, ParamGroup, Frontier.
  • HNSW vector index on Page.embedding (384-dim, cosine similarity).
  • GQL query interface for raw queries via forager query.
  • Schema initialization, CRUD for each node type, stats helpers.
  • C FFI (ffi module) – builds as a shared library (libforager_db.so/.dylib/.dll) for direct database access from other languages. Exposes forager_db_open, forager_db_query (returns JSON), forager_string_free, and forager_db_close. Used by the Julia analysis scripts to query the database without going through the CLI.