Tune
The tune stage is where the crawler learns. After each round of fetching and scoring, it trains the DQN agent on the outcomes and updates all adaptive parameters. This is the feedback loop that makes forager improve over time rather than just repeating the same strategy.
Double DQN
Forager uses a Double DQN architecture to avoid the overestimation bias that plagues standard DQN:
- Online network: selects the best action (which URL to fetch). This is the network that gets trained every round.
- Target network: evaluates the value of that action. This is a frozen copy of the online network, updated less frequently (
target_update_freqrounds).
The separation means the agent does not chase its own optimistic estimates. The online network picks what it thinks is best; the target network provides a more conservative value estimate.
Network architecture
The DQN is a small feedforward network:
Input (11) → Linear(30) → LeakyReLU → Linear(15) → LeakyReLU → Linear(1)
11 input features (the URL feature vector), two hidden layers of 30 and 15 units with LeakyReLU activations, and a single output: the estimated Q-value.
This is deliberately compact. The feature vector already encodes a lot of domain knowledge (depth, domain stats, parent score, anchor similarity), so the network’s job is to learn a relatively simple value function over those features.
Prioritised Experience Replay
Not all training examples are equally useful. A transition where the model was badly wrong (high TD error) teaches more than one where the prediction was already close.
Prioritised Experience Replay (PER) samples transitions with probability proportional to their priority:
priority = |TD_error| + epsilon
Where epsilon (per_epsilon, default: 1e-4) prevents any transition from having zero probability. The per_alpha parameter (default: 0.6) controls how strongly priority affects sampling – 0 gives uniform sampling, 1 gives fully prioritised sampling.
Importance sampling correction
Prioritised sampling introduces bias because high-priority transitions are over-represented. Importance sampling weights correct for this:
w_i = (N * P(i)) ^ (-beta)
The beta parameter anneals from a starting value toward 1.0 over the course of training. Early on, the correction is weak (the agent learns fast but with some bias). Later, it becomes exact (unbiased gradient updates for fine-tuning).
Epsilon schedule
The agent balances exploration and exploitation using an epsilon-greedy strategy:
- With probability
epsilon, pick a random URL (exploration). - With probability
1 - epsilon, pick the URL with the highest Q-value (exploitation).
Epsilon decays linearly from epsilon.start to epsilon.end over epsilon.decay_steps rounds. Early rounds explore broadly; later rounds exploit what the model has learned.
Adaptive parameter updates
The DQN is not the only thing that learns. All Param<T> parameters across every stage update once per round during tune:
- Scoring: signal weights, relevance threshold, semantic weight
- Fetch: timeouts, jitter, redirect limits
- Parse: link limits, propagation threshold, embed threshold
- Select: domain caps, slow/unreliable thresholds
- Tune: PER alpha, epsilon schedule
Parameters in auto mode are adjusted based on observed outcomes. Parameters in fixed mode stay constant. Parameters in range mode are bounded but learnable.
Training loop
flowchart TD
results["Round results<br/>(URL, score, features)"]
store["Store transitions<br/>in replay buffer"]
sample["PER: sample batch<br/>by priority"]
forward["Online net:<br/>select best action"]
target["Target net:<br/>evaluate Q-value"]
td["Compute TD error<br/>+ IS weights"]
backprop["Backprop +<br/>update online net"]
update_prio["Update priorities<br/>in replay buffer"]
update_target{"Update target net?<br/>(every N rounds)"}
params["Update all<br/>adaptive params"]
blend["Blend reference<br/>embedding"]
results --> store --> sample
sample --> forward --> target --> td
td --> backprop --> update_prio
update_prio --> update_target
update_target -->|"yes"| params
update_target -->|"no"| params
params --> blend
style results fill:#2d2d2d,stroke:#555,color:#eee
style store fill:#1a3a4a,stroke:#4a9,color:#eee
style sample fill:#1a3a4a,stroke:#4a9,color:#eee
style forward fill:#1a3a4a,stroke:#4a9,color:#eee
style target fill:#1a3a4a,stroke:#4a9,color:#eee
style td fill:#1a3a4a,stroke:#4a9,color:#eee
style backprop fill:#1a3a1a,stroke:#4a4,color:#eee
style update_prio fill:#1a3a4a,stroke:#4a9,color:#eee
style update_target fill:#2d2d2d,stroke:#555,color:#eee
style params fill:#1a3a1a,stroke:#4a4,color:#eee
style blend fill:#1a3a1a,stroke:#4a4,color:#eee
Configuration
See the [tune] config reference for all tuneable fields including replay buffer size, learning rate, epsilon schedule, and PER parameters.