Select
The select stage decides which URLs to fetch next. Out of potentially thousands of candidates sitting in the frontier, it needs to pick a small batch that balances exploration (trying new domains and paths) with exploitation (fetching URLs the model thinks will score well). It does this through tree-based partitioning, neural scoring, and domain-level filtering.
TRES tree partitioning
The frontier is organized as a CART regression tree (called TRES – Tree-based Reward Estimation for Sampling). Each URL in the frontier has an 11-dimensional feature vector. The tree splits on these features using reward as the regression target, creating leaf nodes that group similar URLs together.
At selection time, forager samples one candidate from each leaf. This guarantees diversity – even if 90% of the frontier is from a single domain, the tree forces the batch to include URLs from different regions of feature space.
DQN scoring
Once candidates are sampled from the tree, the DQN agent scores them all in a single batched forward pass. Each candidate’s feature vector goes through the network and comes out as a Q-value – the model’s estimate of how much reward fetching that URL will produce.
Candidates are ranked by Q-value. The top pages_per_round candidates survive to be fetched.
Domain profiling
Before a candidate makes it into the final batch, domain-level checks filter out bad sources:
- Slow domains: if a domain’s median response time exceeds
slow_domain_ms, it is deprioritised. Life is too short to wait for overloaded servers. - Unreliable domains: if a domain’s fetch failure rate exceeds
unreliable_threshold, its URLs are skipped. No point hammering a host that keeps returning errors. - Page caps: each domain has a maximum page count (
domain_max_pages). Once reached, no more URLs from that domain are selected. This prevents any single site from dominating the crawl. - Already fetched: URLs that have been fetched before are excluded. The frontier tracks visited URLs by hash.
Feature vector
Every URL in the frontier is described by an 11-dimensional feature vector. These features are the inputs to the DQN and the splitting criteria for the TRES tree.
| # | Feature | Description |
|---|---|---|
| 1 | depth | Link depth from the nearest seed URL |
| 2 | domain_relevance | Average score of pages already fetched from this domain |
| 3 | domain_count | Number of pages fetched from this domain so far |
| 4 | domain_failure_rate | Fraction of fetches from this domain that failed |
| 5 | parent_score | Relevance score of the page that linked to this URL |
| 6 | anchor_similarity | Embedding similarity between anchor text and the reference |
| 7 | sibling_score | Average score of other URLs discovered on the same parent page |
| 8 | path_depth | Number of / segments in the URL path |
| 9 | has_query | Whether the URL contains query parameters (0 or 1) |
| 10 | domain_weight | User-configured weight for this domain (from fetch.domain_weights) |
| 11 | round_discovered | The crawl round in which this URL was first seen |
Selection flow
flowchart TD
frontier["Frontier tree<br/>(all pending URLs)"]
sample["Sample 1 candidate<br/>per leaf node"]
dqn["DQN forward pass<br/>(batched Q-values)"]
rank["Rank by Q-value"]
filter["Filter out:<br/>- capped domains<br/>- unreliable domains<br/>- slow domains<br/>- already fetched"]
batch["Final batch<br/>(pages_per_round URLs)"]
frontier --> sample --> dqn --> rank --> filter --> batch
style frontier fill:#2d2d2d,stroke:#555,color:#eee
style sample fill:#1a3a4a,stroke:#4a9,color:#eee
style dqn fill:#1a3a4a,stroke:#4a9,color:#eee
style rank fill:#1a3a4a,stroke:#4a9,color:#eee
style filter fill:#3a1a1a,stroke:#a44,color:#eee
style batch fill:#1a3a1a,stroke:#4a4,color:#eee
Configuration
See the [select] config reference for all tuneable fields.