Crate Architecture
Forager is organized into six workspace crates. Each crate owns a distinct slice of functionality; dependencies flow downward.
forager (binary)
|-- forager-core (shared types, config, params)
|-- forager-web (HTTP fetching, HTML parsing, link discovery)
|-- forager-ml (scoring, DQN agent, embeddings)
|-- forager-frontier (URL selection tree)
|-- forager-db (graph database layer + C FFI)
forager
The top-level binary crate. Owns the CLI, command dispatch, pipeline orchestration, and stage execution.
- cli – clap-based argument parsing, 8 subcommands.
- commands – one module per command (
new,import,run,status,tune,log,list,query). - pipeline – round-based crawl loop: select, fetch, parse, score, tune, persist.
- stages – individual pipeline stage implementations wiring the other crates together.
- util – config resolution, path helpers.
forager-core
Shared types used across the workspace. No heavy dependencies.
- config – TOML deserialization structs mirroring the pipeline stages (
Config,SelectSettings,FetchSettings,ParseSettings,ScoreSettings,TuneSettings). - param –
Param<T>withFixed,Auto, andRangemodes. Supportslearn(),merge_from(), and clamping. - param_group –
ParamGrouptrait: lifecycle for stage-aligned adaptive parameter groups (persist, restore, update). - groups – five param groups, one per configurable stage:
ScoreParams,FetchParams,ParseParams,SelectParams,TuneParams. - import – CSV term loading and DB term merge logic.
- url – URL normalization and domain extraction.
- types –
PageScore,FrontierEntry,DomainProfile,PageObservation.
forager-web
Everything related to HTTP and HTML.
- fetch – async HTTP client with concurrency control, stealth headers, Cloudflare detection, per-request timeout.
- parse – HTML parsing with timeout, text extraction by region (title, headings, body).
- discover – link extraction from parsed HTML, URL filter application (substring, domain, regex, allow, max_depth).
- extract – CSS selector-based field extraction for structured data.
forager-ml
Machine learning: scoring and the RL agent.
- score – keyword matching (flat terms, grouped terms with geometric mean) and semantic similarity scoring (MiniLM-L6-v2, 384-dim).
- dqn – Double DQN with Prioritised Experience Replay:
- network – Q-network (11 → 30 → 15 → 1, LeakyReLU).
- buffer – PER buffer with stratified sampling and importance weights.
- agent – batched candidate scoring, DDQN training, target network sync, epsilon schedule.
- embedder – sentence embedding via MiniLM on Burn/Wgpu (Vulkan).
forager-frontier
URL selection using a TRES (Tree-based Region Exploration Strategy) tree.
- FrontierManager – maintains the tree, samples one candidate per leaf, tracks domain page counts.
- tree – CART regression tree partitioning URLs by feature vectors:
- node – tree nodes (leaf/internal), variance reduction split criterion.
- experience – stored observations per leaf (feature vectors + rewards).
forager-db
Persistence layer wrapping GrafeoDB (embedded graph database).
- Node types:
CrawlRun,Page,Term,Transition,Domain,Model,ParamGroup,Frontier. - HNSW vector index on
Page.embedding(384-dim, cosine similarity). - GQL query interface for raw queries via
forager query. - Schema initialization, CRUD for each node type, stats helpers.
- C FFI (
ffimodule) – builds as a shared library (libforager_db.so/.dylib/.dll) for direct database access from other languages. Exposesforager_db_open,forager_db_query(returns JSON),forager_string_free, andforager_db_close. Used by the Julia analysis scripts to query the database without going through the CLI.