Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

[tune] – Tune Configuration

The [tune] section controls the DQN training loop and experience replay. See Tune for a detailed explanation of the learning process.

Fields

[tune]
replay_capacity    = 50000
batch_size         = 64
learning_rate      = 0.001
gamma              = 0.99
lr_decay           = 0.995
replay_period      = 4
target_update_freq = 100
min_replay_size    = 500
per_alpha          = { value = 0.6, mode = "auto" }
per_epsilon        = { value = 0.0001, mode = "fixed" }

[tune.epsilon]
start       = 1.0
end         = 0.05
decay_steps = 500
FieldTypeDefaultDescription
replay_capacityusize50000Maximum transitions stored in the replay buffer
batch_sizeusize64Transitions sampled per training step
learning_ratef640.001Adam optimiser learning rate
gammaf640.99Discount factor for future rewards
lr_decayf640.995Learning rate decay multiplier applied each round
replay_periodusize4Train every N transitions (not every round)
target_update_frequsize100Rounds between target network updates
min_replay_sizeusize500Minimum buffer size before training begins
per_alphaParam<f64>auto(0.6)PER prioritisation exponent (0 = uniform, 1 = full)
per_epsilonParam<f64>fixed(1e-4)Small constant added to TD error to prevent zero priority

Replay buffer

The replay buffer stores (state, action, reward, next_state) transitions from every page fetched. Once it reaches replay_capacity, old transitions are evicted. Training does not start until the buffer has at least min_replay_size entries, ensuring the agent has enough experience for meaningful gradient updates.

Learning rate

learning_rate is the initial rate for the Adam optimiser. lr_decay multiplies it after each round, producing exponential decay. This lets the agent make large updates early (when it knows little) and fine-tune later.

Discount factor

gamma controls how much the agent values future rewards versus immediate ones. At 0.99, a reward 100 steps in the future is worth about 37% of an immediate reward. Lower values make the agent more short-sighted.

[tune.epsilon]

The epsilon-greedy exploration schedule.

FieldTypeDefaultDescription
startf641.0Initial exploration rate (100% random)
endf640.05Final exploration rate
decay_stepsusize500Rounds over which epsilon decays

Epsilon decays linearly from start to end over decay_steps rounds. At start = 1.0, every action in round 1 is random. By the time decay_steps rounds have passed, only 5% of actions are random.

PER parameters

per_alpha controls how aggressively the replay buffer favours high-error transitions. At 0.0, all transitions are sampled uniformly. At 1.0, sampling is fully proportional to TD error. The default of 0.6 is a common sweet spot. In auto mode, the crawler adjusts this based on training stability.

per_epsilon is a small constant added to every transition’s priority so that no transition ever has exactly zero probability of being sampled. This is fixed at 1e-4 and rarely needs changing.