[tune] – Tune Configuration
The [tune] section controls the DQN training loop and experience replay. See Tune for a detailed explanation of the learning process.
Fields
[tune]
replay_capacity = 50000
batch_size = 64
learning_rate = 0.001
gamma = 0.99
lr_decay = 0.995
replay_period = 4
target_update_freq = 100
min_replay_size = 500
per_alpha = { value = 0.6, mode = "auto" }
per_epsilon = { value = 0.0001, mode = "fixed" }
[tune.epsilon]
start = 1.0
end = 0.05
decay_steps = 500
| Field | Type | Default | Description |
|---|---|---|---|
replay_capacity | usize | 50000 | Maximum transitions stored in the replay buffer |
batch_size | usize | 64 | Transitions sampled per training step |
learning_rate | f64 | 0.001 | Adam optimiser learning rate |
gamma | f64 | 0.99 | Discount factor for future rewards |
lr_decay | f64 | 0.995 | Learning rate decay multiplier applied each round |
replay_period | usize | 4 | Train every N transitions (not every round) |
target_update_freq | usize | 100 | Rounds between target network updates |
min_replay_size | usize | 500 | Minimum buffer size before training begins |
per_alpha | Param<f64> | auto(0.6) | PER prioritisation exponent (0 = uniform, 1 = full) |
per_epsilon | Param<f64> | fixed(1e-4) | Small constant added to TD error to prevent zero priority |
Replay buffer
The replay buffer stores (state, action, reward, next_state) transitions from every page fetched. Once it reaches replay_capacity, old transitions are evicted. Training does not start until the buffer has at least min_replay_size entries, ensuring the agent has enough experience for meaningful gradient updates.
Learning rate
learning_rate is the initial rate for the Adam optimiser. lr_decay multiplies it after each round, producing exponential decay. This lets the agent make large updates early (when it knows little) and fine-tune later.
Discount factor
gamma controls how much the agent values future rewards versus immediate ones. At 0.99, a reward 100 steps in the future is worth about 37% of an immediate reward. Lower values make the agent more short-sighted.
[tune.epsilon]
The epsilon-greedy exploration schedule.
| Field | Type | Default | Description |
|---|---|---|---|
start | f64 | 1.0 | Initial exploration rate (100% random) |
end | f64 | 0.05 | Final exploration rate |
decay_steps | usize | 500 | Rounds over which epsilon decays |
Epsilon decays linearly from start to end over decay_steps rounds. At start = 1.0, every action in round 1 is random. By the time decay_steps rounds have passed, only 5% of actions are random.
PER parameters
per_alpha controls how aggressively the replay buffer favours high-error transitions. At 0.0, all transitions are sampled uniformly. At 1.0, sampling is fully proportional to TD error. The default of 0.6 is a common sweet spot. In auto mode, the crawler adjusts this based on training stability.
per_epsilon is a small constant added to every transition’s priority so that no transition ever has exactly zero probability of being sampled. This is fixed at 1e-4 and rarely needs changing.