Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tune

The tune stage is where the crawler learns. After each round of fetching and scoring, it trains the DQN agent on the outcomes and updates all adaptive parameters. This is the feedback loop that makes forager improve over time rather than just repeating the same strategy.

Double DQN

Forager uses a Double DQN architecture to avoid the overestimation bias that plagues standard DQN:

  • Online network: selects the best action (which URL to fetch). This is the network that gets trained every round.
  • Target network: evaluates the value of that action. This is a frozen copy of the online network, updated less frequently (target_update_freq rounds).

The separation means the agent does not chase its own optimistic estimates. The online network picks what it thinks is best; the target network provides a more conservative value estimate.

Network architecture

The DQN is a small feedforward network:

Input (11) → Linear(30) → LeakyReLU → Linear(15) → LeakyReLU → Linear(1)

11 input features (the URL feature vector), two hidden layers of 30 and 15 units with LeakyReLU activations, and a single output: the estimated Q-value.

This is deliberately compact. The feature vector already encodes a lot of domain knowledge (depth, domain stats, parent score, anchor similarity), so the network’s job is to learn a relatively simple value function over those features.

Prioritised Experience Replay

Not all training examples are equally useful. A transition where the model was badly wrong (high TD error) teaches more than one where the prediction was already close.

Prioritised Experience Replay (PER) samples transitions with probability proportional to their priority:

priority = |TD_error| + epsilon

Where epsilon (per_epsilon, default: 1e-4) prevents any transition from having zero probability. The per_alpha parameter (default: 0.6) controls how strongly priority affects sampling – 0 gives uniform sampling, 1 gives fully prioritised sampling.

Importance sampling correction

Prioritised sampling introduces bias because high-priority transitions are over-represented. Importance sampling weights correct for this:

w_i = (N * P(i)) ^ (-beta)

The beta parameter anneals from a starting value toward 1.0 over the course of training. Early on, the correction is weak (the agent learns fast but with some bias). Later, it becomes exact (unbiased gradient updates for fine-tuning).

Epsilon schedule

The agent balances exploration and exploitation using an epsilon-greedy strategy:

  • With probability epsilon, pick a random URL (exploration).
  • With probability 1 - epsilon, pick the URL with the highest Q-value (exploitation).

Epsilon decays linearly from epsilon.start to epsilon.end over epsilon.decay_steps rounds. Early rounds explore broadly; later rounds exploit what the model has learned.

Adaptive parameter updates

The DQN is not the only thing that learns. All Param<T> parameters across every stage update once per round during tune:

  • Scoring: signal weights, relevance threshold, semantic weight
  • Fetch: timeouts, jitter, redirect limits
  • Parse: link limits, propagation threshold, embed threshold
  • Select: domain caps, slow/unreliable thresholds
  • Tune: PER alpha, epsilon schedule

Parameters in auto mode are adjusted based on observed outcomes. Parameters in fixed mode stay constant. Parameters in range mode are bounded but learnable.

Training loop

flowchart TD
    results["Round results<br/>(URL, score, features)"]
    store["Store transitions<br/>in replay buffer"]
    sample["PER: sample batch<br/>by priority"]
    forward["Online net:<br/>select best action"]
    target["Target net:<br/>evaluate Q-value"]
    td["Compute TD error<br/>+ IS weights"]
    backprop["Backprop +<br/>update online net"]
    update_prio["Update priorities<br/>in replay buffer"]
    update_target{"Update target net?<br/>(every N rounds)"}
    params["Update all<br/>adaptive params"]
    blend["Blend reference<br/>embedding"]

    results --> store --> sample
    sample --> forward --> target --> td
    td --> backprop --> update_prio
    update_prio --> update_target
    update_target -->|"yes"| params
    update_target -->|"no"| params
    params --> blend

    style results fill:#2d2d2d,stroke:#555,color:#eee
    style store fill:#1a3a4a,stroke:#4a9,color:#eee
    style sample fill:#1a3a4a,stroke:#4a9,color:#eee
    style forward fill:#1a3a4a,stroke:#4a9,color:#eee
    style target fill:#1a3a4a,stroke:#4a9,color:#eee
    style td fill:#1a3a4a,stroke:#4a9,color:#eee
    style backprop fill:#1a3a1a,stroke:#4a4,color:#eee
    style update_prio fill:#1a3a4a,stroke:#4a9,color:#eee
    style update_target fill:#2d2d2d,stroke:#555,color:#eee
    style params fill:#1a3a1a,stroke:#4a4,color:#eee
    style blend fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [tune] config reference for all tuneable fields including replay buffer size, learning rate, epsilon schedule, and PER parameters.