Technical Reference
Complete specification of every computation, formula, and parameter in the forager pipeline.
Pipeline Overview
flowchart TB
subgraph ROUND["Each Round"]
direction TB
SELECT["<b>Select</b><br/>DQN ranks frontier candidates<br/>Filter: already fetched, domain capped, unreliable"]
FETCH["<b>Fetch</b><br/>Concurrent HTTP with stealth headers<br/>Per-request timeout + jitter from params"]
PARSE["<b>Parse</b><br/>Parallel HTML → title, headings, body<br/>Adaptive truncation per domain"]
SCORE["<b>Score</b><br/>Semantic + keyword blend<br/>Language + domain weight adjustments"]
DISCOVER["<b>Discover</b><br/>Extract links, batch anchor embeddings<br/>Skip GPU for low-scoring pages"]
PROCESS["<b>Process</b><br/>Compute 11-dim feature vectors<br/>Build children if score ≥ propagation_threshold"]
INTEGRATE["<b>Integrate</b><br/>DB write, learner observations<br/>Domain profiles, DQN training"]
LEARN["<b>Learn</b><br/>All param groups update<br/>Reference embedding refines"]
PERSIST["<b>Persist</b><br/>Param groups, frontier tree<br/>DQN model, domain profiles"]
SELECT --> FETCH --> PARSE --> SCORE --> DISCOVER --> PROCESS --> INTEGRATE --> LEARN --> PERSIST
end
PERSIST -.->|"next round"| SELECT
Scoring System
Total Score
flowchart LR
subgraph SEMANTIC["Semantic Similarity"]
REF["Reference<br/>embedding<br/>(384-dim)"]
ANTI["Anti-reference<br/>embedding"]
TITLE["title embedding"]
HEAD["heading embedding"]
BODY["body embedding"]
REF --> |"cosine sim"| TA["title_aff"]
REF --> |"cosine sim"| HA["heading_aff"]
REF --> |"cosine sim"| BA["body_aff"]
ANTI --> |"penalty"| TA & HA & BA
end
subgraph KEYWORD["Keyword Matching"]
TERMS["term groups<br/>(required + optional)"]
TERMS --> DENSITY["keyword_density"]
end
TA & HA & BA --> BLEND["signal blend"]
BLEND --> AFFINITY["affinity"]
AFFINITY & DENSITY --> TOTAL["total_score"]
Formulas
Per-signal affinity (with anti-reference penalty):
affinity(signal) = max(0, cos(reference, signal_emb) - anti_w × cos(anti_ref, signal_emb))
Multi-signal blend (learnable weights, sum to 1.0):
semantic_affinity = tw × title_aff + hw × heading_aff + bw × body_aff
Total score (semantic + keyword blend):
total = clamp(sem_w × semantic_affinity + (1 - sem_w) × keyword_density, 0, 1)
Keyword density — flat terms:
density = min(1.0, Σ(count_i × weight_i) / word_count × 100)
Keyword density — with term groups (required groups use geometric mean):
required_score = exp(Σ(group_weight × ln(group_density)) / Σ(group_weight))
total = min(1.0, required_score + optional_sum × 0.1)
If any required group has zero density → entire score = 0.
Score Adjustments (pipeline)
adjusted_score = raw_score × lang_factor × domain_factor
lang_factor = lang_penalty (if page language ∉ accepted languages)
= 1.0 (otherwise)
domain_factor = domain_weights[domain] (if configured)
= 1.0 (otherwise)
Reference Blending (end of round)
centroid = mean(relevant_page_embeddings)
reference = (1 - blend) × reference + blend × centroid
reference = reference / ‖reference‖
Feature Vector
11-dimensional feature vector computed per discovered link. This is the DQN’s input.
flowchart LR
subgraph FEATURES["Feature Vector [0..10]"]
direction TB
F0["<b>[0]</b> parent_relevant<br/>1.0 if parent scored above threshold, else 0.0"]
F1["<b>[1]</b> inverse_distance_to_relevant<br/>1 / (hops_since_relevant_ancestor + 1)"]
F2["<b>[2]</b> path_relevance_ratio<br/>relevant_ancestors / total_ancestors"]
F3["<b>[3]</b> keyword_in_url<br/>1.0 if any term appears in URL, else 0.0"]
F4["<b>[4]</b> keyword_in_anchor<br/>1.0 if any term appears in anchor text, else 0.0"]
F5["<b>[5]</b> anchor_relevance<br/>cosine sim of anchor embedding vs reference"]
F6["<b>[6]</b> domain_reward<br/>avg reward for this domain (0.0 if unknown)"]
F7["<b>[7]</b> domain_novelty<br/>1.0 if domain never seen, else 0.0"]
F8["<b>[8]</b> ancestor_depth<br/>min(total_ancestors + 1, 20) / 20"]
F9["<b>[9]</b> required_signal_proximity<br/>1 / (min(hops_since_required_match, 20) + 1)"]
F10["<b>[10]</b> parent_semantic_distance<br/>cosine sim of parent body embedding vs reference"]
end
All features are pre-scaled to approximately [0, 1]. No additional normalization.
DQN Agent
Network Architecture
flowchart LR
IN["features<br/>[batch, 11]"] --> L1["Linear(11 → 30)"] --> A1["LeakyReLU<br/>α = 0.1"] --> L2["Linear(30 → 15)"] --> A2["LeakyReLU<br/>α = 0.1"] --> L3["Linear(15 → 1)"] --> Q["Q-value"]
LeakyReLU: f(x) = max(x, 0.1 × x)
Double DQN Training
flowchart TB
SAMPLE["Sample batch from PER buffer<br/>(stratified by priority)"]
SAMPLE --> ARGMAX["<b>Online net:</b> find best next action<br/>a* = argmax_a Q_online(s', a)<br/>(batched forward pass)"]
ARGMAX --> TARGET["<b>Target net:</b> evaluate best action<br/>Q_target(s', a*)<br/>(batched forward pass)"]
TARGET --> TD["<b>TD target:</b><br/>y = r + γ × Q_target(s', a*)"]
TD --> LOSS["<b>Weighted MSE loss:</b><br/>L = mean(w_i × (Q_online(s) - y)²)<br/>w_i = importance sampling weights"]
LOSS --> BACKWARD["Backward + Adam optimizer step"]
BACKWARD --> PRIORITY["Update priorities:<br/>priority_i = |td_error_i| + ε"]
TD target formula:
y_i = r_i + γ × Q_target(s', argmax_a Q_online(s', a))
y_i = r_i (if no next actions)
Training schedule:
train every: replay_period steps (default: 3)
min buffer size: min_replay_size (default: 64)
batch size: batch_size (default: 60)
target sync: every target_update_freq steps (default: 500)
LR decay: current_lr *= lr_decay at each target sync
Epsilon schedule:
if steps < decay_steps:
ε = ε_start + (ε_end - ε_start) × steps / decay_steps
else:
ε = ε_end
Prioritised Experience Replay
Sampling probability:
P(i) = priority_i^α / Σ priority_j^α
Stratified sampling: divide the cumulative distribution into batch_size equal segments, sample one from each.
Importance sampling weights (bias correction):
w_i = (N × P(i))^(-β) / max_j(w_j)
β = min(1.0, 0.4 + 0.6 × steps / decay_steps)
β anneals from 0.4 → 1.0 over training. At β=1.0, full bias correction.
TRES Tree Frontier
flowchart TB
ROOT["Root leaf<br/>(all URLs)"]
ROOT -->|"split on feature[k] at threshold t"| LEFT["Left leaf<br/>feature[k] ≤ t"]
ROOT --> RIGHT["Right leaf<br/>feature[k] > t"]
LEFT -->|"further split"| LL["..."] & LR["..."]
SAMPLE["<b>Candidate selection:</b><br/>1 random URL per leaf<br/>O(leaves) instead of O(frontier)"]
Split Criterion (CART regression on reward)
variance(values) = Σ(v - mean)² / n
variance_reduction = var(parent) - (n_L/n) × var(left) - (n_R/n) × var(right)
Split on the (feature, threshold) pair that maximizes variance reduction.
Constraints:
min child size = max(min_samples_per_split, ⌊0.15 × n_parent⌋)
max tree depth = max_depth (from config)
Domain Capping
domain_fetch_count[d] += 1 per fetch
capped = domain_fetch_count[d] ≥ domain_max_pages
Capped domains are excluded from candidate selection.
Adaptive Parameter Learning
All learning requires a minimum number of observations before updating. Each group runs update() once per round.
[score] Parameters
flowchart LR
subgraph SCORE_LEARN["Score Learning (per round)"]
OBS["Page observations<br/>(reservoir sampled, max 2000)"]
OBS --> RT["<b>relevance_threshold</b><br/>P25 of nonzero scores<br/>(needs ≥ 20 obs, ≥ 10 nonzero)"]
OBS --> SW["<b>signal weights</b><br/>proportional to which signal<br/>best predicts relevance<br/>(needs ≥ 10 signal hits)"]
OBS --> SEM["<b>semantic_weight</b><br/>avg_semantic / (avg_semantic + avg_keyword)<br/>among relevant pages<br/>(needs ≥ 5 relevant)"]
end
Signal weight normalization:
tw = title_hits / total_signal_hits
hw = heading_hits / total_signal_hits
bw = body_hits / total_signal_hits
normalize: tw, hw, bw = tw/sum, hw/sum, bw/sum
[fetch] Parameters
| Parameter | Formula | Conditions |
|---|---|---|
timeout_ms | clamp(P95(durations) × 1.5, 1000, 30000) | ≥ 20 fetches, ≥ 10 durations |
connect_timeout_ms | clamp(P50(durations) × 0.75, 500, 10000) | ≥ 20 fetches, ≥ 10 durations |
jitter_max_ms | current × 1.5 if >5% rate-limited, × 1.2 if >1%, × 0.9 if safe | ≥ 20 fetches |
max_links_per_page | clamp(avg_links × 2, 50, 500) | ≥ 10 pages, >10% change |
embed_threshold_factor | linear map: kw_ratio 0.05→0.3, 0.30→0.7 | ≥ 10 pages |
propagation_threshold | 0.0 if >30% yield, 0.02 if >10%, 0.05 otherwise | ≥ 10 low-score pages |
[select] Parameters
| Parameter | Formula | Conditions |
|---|---|---|
domain_max_pages | median exhaustion point: rolling-5 avg drops below 50% of mean | ≥ 3 domains with ≥ 5 pages |
slow_domain_ms | clamp(P90(latencies), 500, 15000) | ≥ 10 observations, ≥ 10 latencies |
unreliable_threshold | clamp(P10(success_rates), 0.1, 0.6) | ≥ 10 observations, ≥ 5 rates |
html_limit_factor | not yet learned (config knob) | — |
[tune] Parameters
| Parameter | Formula | Conditions |
|---|---|---|
per_alpha | clamp(0.3 + (CV - 0.5) / 1.5 × 0.5, 0.3, 0.8) where CV = σ/μ of TD errors | ≥ 20 steps, ≥ 10 errors |
per_epsilon | not learned (fixed priority floor) | — |
Domain Profiling
flowchart LR
FETCH_EVENT["fetch event"] --> PROFILE["DomainProfile"]
PROFILE --> AVG_F["avg_fetch_ms<br/>running mean"]
PROFILE --> AVG_P["avg_parse_ms<br/>running mean"]
PROFILE --> AVG_H["avg_html_bytes<br/>running mean"]
PROFILE --> SR["success_rate<br/>successes / fetches"]
PROFILE --> AR["avg_reward<br/>reward_sum / fetches"]
AVG_F & AVG_P --> SLOW{"is_slow?<br/>fetch+parse > slow_domain_ms<br/>(after ≥ 3 fetches)"}
SR --> UNRELIABLE{"is_unreliable?<br/>success_rate < threshold<br/>(after ≥ 5 fetches)"}
AVG_H --> HTML{"html_limit<br/>avg_bytes × factor<br/>clamped [50KB, 2MB]<br/>(default 512KB until ≥ 3 fetches)"}
Data Flow
flowchart TB
CONFIG["TOML Config<br/>user intent + defaults"] --> INIT["init()"]
DB_RESTORE["DB: ParamGroup nodes<br/>DQN model, frontier tree"] --> INIT
INIT --> STATE["PipelineState"]
STATE --> |"each round"| PIPELINE["Pipeline Loop"]
PIPELINE --> |"observations"| SCORE_G["ScoreParams.observe()"]
PIPELINE --> |"observations"| FETCH_G["FetchParams.observe()"]
PIPELINE --> |"observations"| FRONT_G["SelectParams.observe()"]
PIPELINE --> |"TD errors"| DQN_G["TuneParams.observe()"]
SCORE_G & FETCH_G & FRONT_G & DQN_G --> |"group.update()"| LEARN_STEP["End-of-round learning"]
LEARN_STEP --> |"group.to_json()"| DB_SAVE["DB: save_param_group()"]
DB_SAVE --> |"next run"| DB_RESTORE
All 22 Adaptive Parameters
| Section | Parameter | Default | Mode | What it controls |
|---|---|---|---|---|
| select | domain_max_pages | 100 | auto | Per-domain page cap |
| slow_domain_ms | 3000 | auto | Latency threshold for “slow” | |
| unreliable_threshold | 0.3 | auto | Success rate floor for “unreliable” | |
| html_limit_factor | 2.0 | auto | Multiplier on avg HTML for truncation | |
| fetch | timeout_ms | 8000 | auto | HTTP request timeout |
| connect_timeout_ms | 3000 | auto | TCP connect timeout | |
| max_redirects | 3 | fixed | Redirect limit | |
| pool_idle_per_host | 8 | fixed | Connection pool size | |
| jitter_max_ms | 50 | auto | Random delay before requests | |
| parse | max_links_per_page | 200 | auto | Link extraction cap |
| anchor_batch_size | 512 | fixed | GPU batch cap for anchor embeddings | |
| embed_threshold_factor | 0.5 | auto | Min page score for anchor GPU work | |
| propagation_threshold | 0.0 | auto | Min score to propagate children | |
| score | relevance_threshold | 0.1 | auto | Score cutoff for “relevant” |
| semantic_weight | 0.7 | range [0.3, 0.9] | Semantic vs keyword blend | |
| anti_weight | 0.3 | range [0.1, 0.5] | Anti-reference penalty strength | |
| title_weight | 0.4 | auto | Title signal importance | |
| heading_weight | 0.3 | auto | Heading signal importance | |
| body_weight | 0.3 | auto | Body signal importance | |
| reference_blend | 0.1 | range [0.0, 0.3] | Reference adaptation speed | |
| lang_penalty | 0.0 | auto | Score multiplier for wrong language | |
| tune | per_alpha | 0.6 | auto | PER prioritisation exponent |
| per_epsilon | 1e-4 | fixed | PER priority floor |