Pipeline

Forager’s crawl loop is a five-stage pipeline that runs in rounds. Each round selects URLs, fetches them, parses the HTML, scores relevance, and tunes the model. Every stage feeds back into the others through learning – the crawler gets smarter with each round.

Stage flow

flowchart TD
    init["init — load config, build components"]
    restore["restore — reload frontier + replay buffer"]
    select["select — sample URLs from frontier tree"]
    fetch["fetch — concurrent HTTP requests"]
    parse["parse — extract text + links"]
    score["score — relevance scoring"]
    tune["tune — DQN training + param update"]

    init --> restore --> select
    select --> fetch --> parse --> score --> tune
    tune -->|"next round"| select

    style init fill:#2d2d2d,stroke:#555,color:#eee
    style restore fill:#2d2d2d,stroke:#555,color:#eee
    style select fill:#1a3a4a,stroke:#4a9,color:#eee
    style fetch fill:#1a3a4a,stroke:#4a9,color:#eee
    style parse fill:#1a3a4a,stroke:#4a9,color:#eee
    style score fill:#1a3a4a,stroke:#4a9,color:#eee
    style tune fill:#1a3a4a,stroke:#4a9,color:#eee

Stage overview

Select

Picks the next batch of URLs from the frontier. The TRES tree partitions candidates by their feature vectors and samples one per leaf node, guaranteeing diversity. The DQN agent scores each candidate, and domain profiling filters out slow or unreliable sources.

Fetch

Downloads pages concurrently with stealth headers and adaptive timeouts. Failed fetches produce zero-reward transitions so the agent learns to avoid those URL patterns. Per-domain rate limiting and jitter keep the crawler polite.

Parse

Extracts structured text (title, headings, body) and outbound links from HTML. Anchor text is batch-embedded on GPU for keyword matching. Domain-adaptive truncation caps input size to avoid wasting compute on bloated pages.

Score

Determines how relevant each page is by blending semantic embedding similarity with keyword density. Three text signals (title, heading, body) are weighted independently, and all weights are learnable. Pages above the relevance threshold contribute positive reward.

Tune

Trains the Double DQN agent with prioritised experience replay. Updates all adaptive parameters (scoring weights, fetch timeouts, parse limits, selection thresholds) based on observed outcomes. Blends the reference embedding toward recently found relevant pages.

Startup stages

Before the main loop begins, two one-time stages run:

init reads the configuration file and constructs the scorer, DQN agent, frontier tree, fetcher, and filter chain. All learnable parameters (Param<T>) are initialised from their config values and modes.

restore reloads the frontier tree and candidate queue from the database on resume. Experience transitions with positive reward are loaded back into the replay buffer so the agent retains what it learned.

Feedback loops

The pipeline is not a simple conveyor belt. Each stage produces information that improves the others:

Score informs tune: page relevance becomes the DQN reward signal.
Tune informs select: updated Q-values change which URLs are prioritised.
Parse informs select: discovered child URLs expand the frontier.
Fetch informs fetch: domain profiles adjust timeouts and concurrency.
Tune informs score: learnable weights (signal blend, thresholds) shift based on what works.

Graceful shutdown

The pipeline honours Ctrl-C at every stage boundary and at every page within a round. State is always persisted before exit, so resuming loses at most the in-flight page.

Keyboard shortcuts

Forager