Introduction
Forager is a deep reinforcement learning web crawler designed for niche discovery. Given a description of what you are looking for, it crawls the web, scores what it finds, and learns to find more of it. It was built to solve a specific problem – finding European master’s programmes in process philosophy – but the architecture is general: describe a target, point it at some seed URLs, and let the system refine itself.
How it works
Forager operates as a stage-based pipeline. A DQN agent decides which URLs to fetch next. Each page is scored using a combination of keyword groups and semantic similarity (via a MiniLM-L6-v2 sentence embedding model). High-scoring pages reinforce the agent; low-scoring pages teach it to look elsewhere. The frontier, the scoring weights, and the reference embedding all adapt as the crawl progresses.
You do not need to know how any of this works to use it. The system is config-driven: you write a TOML file describing what you want (search terms, a reference paragraph, seed URLs), and forager learns the rest. Parameters marked mode = "auto" or mode = "range" are adjusted automatically during crawling based on observed results.
Stack
- Rust – the entire pipeline, CLI, and RL agent
- Burn (Wgpu/Vulkan backend) – tensor operations for the DQN and embedding inference
- GrafeoDB – graph database for pages, links, domains, and frontier state
- MiniLM-L6-v2 – sentence transformer for semantic scoring (auto-downloaded from HuggingFace on first run)
Workflow
Forager uses a package-based workflow. Each crawl lives in its own directory under data/, containing a config file, a database, and learned parameters. You create a package, optionally import terms, run the crawl, and check status through the CLI. All data lives in the graph database and can be queried directly – from the CLI, from Julia via the C FFI, or from Rust as a library.
forager new myproject --reference "..."
forager run myproject
forager status myproject
forager query myproject "MATCH (p:Page) WHERE p.score > 0.3 RETURN p.url, p.score"
The next pages walk through installation and running your first crawl.