Pipeline
Forager’s crawl loop is a five-stage pipeline that runs in rounds. Each round selects URLs, fetches them, parses the HTML, scores relevance, and tunes the model. Every stage feeds back into the others through learning – the crawler gets smarter with each round.
Stage flow
flowchart TD
init["init — load config, build components"]
restore["restore — reload frontier + replay buffer"]
select["select — sample URLs from frontier tree"]
fetch["fetch — concurrent HTTP requests"]
parse["parse — extract text + links"]
score["score — relevance scoring"]
tune["tune — DQN training + param update"]
init --> restore --> select
select --> fetch --> parse --> score --> tune
tune -->|"next round"| select
style init fill:#2d2d2d,stroke:#555,color:#eee
style restore fill:#2d2d2d,stroke:#555,color:#eee
style select fill:#1a3a4a,stroke:#4a9,color:#eee
style fetch fill:#1a3a4a,stroke:#4a9,color:#eee
style parse fill:#1a3a4a,stroke:#4a9,color:#eee
style score fill:#1a3a4a,stroke:#4a9,color:#eee
style tune fill:#1a3a4a,stroke:#4a9,color:#eee
Stage overview
Select
Picks the next batch of URLs from the frontier. The TRES tree partitions candidates by their feature vectors and samples one per leaf node, guaranteeing diversity. The DQN agent scores each candidate, and domain profiling filters out slow or unreliable sources.
Fetch
Downloads pages concurrently with stealth headers and adaptive timeouts. Failed fetches produce zero-reward transitions so the agent learns to avoid those URL patterns. Per-domain rate limiting and jitter keep the crawler polite.
Parse
Extracts structured text (title, headings, body) and outbound links from HTML. Anchor text is batch-embedded on GPU for keyword matching. Domain-adaptive truncation caps input size to avoid wasting compute on bloated pages.
Score
Determines how relevant each page is by blending semantic embedding similarity with keyword density. Three text signals (title, heading, body) are weighted independently, and all weights are learnable. Pages above the relevance threshold contribute positive reward.
Tune
Trains the Double DQN agent with prioritised experience replay. Updates all adaptive parameters (scoring weights, fetch timeouts, parse limits, selection thresholds) based on observed outcomes. Blends the reference embedding toward recently found relevant pages.
Startup stages
Before the main loop begins, two one-time stages run:
init reads the configuration file and constructs the scorer, DQN agent, frontier tree, fetcher, and filter chain. All learnable parameters (Param<T>) are initialised from their config values and modes.
restore reloads the frontier tree and candidate queue from the database on resume. Experience transitions with positive reward are loaded back into the replay buffer so the agent retains what it learned.
Feedback loops
The pipeline is not a simple conveyor belt. Each stage produces information that improves the others:
- Score informs tune: page relevance becomes the DQN reward signal.
- Tune informs select: updated Q-values change which URLs are prioritised.
- Parse informs select: discovered child URLs expand the frontier.
- Fetch informs fetch: domain profiles adjust timeouts and concurrency.
- Tune informs score: learnable weights (signal blend, thresholds) shift based on what works.
Graceful shutdown
The pipeline honours Ctrl-C at every stage boundary and at every page within a round. State is always persisted before exit, so resuming loses at most the in-flight page.