Select

The select stage decides which URLs to fetch next. Out of potentially thousands of candidates sitting in the frontier, it needs to pick a small batch that balances exploration (trying new domains and paths) with exploitation (fetching URLs the model thinks will score well). It does this through tree-based partitioning, neural scoring, and domain-level filtering.

TRES tree partitioning

The frontier is organized as a CART regression tree (called TRES – Tree-based Reward Estimation for Sampling). Each URL in the frontier has an 11-dimensional feature vector. The tree splits on these features using reward as the regression target, creating leaf nodes that group similar URLs together.

At selection time, forager samples one candidate from each leaf. This guarantees diversity – even if 90% of the frontier is from a single domain, the tree forces the batch to include URLs from different regions of feature space.

DQN scoring

Once candidates are sampled from the tree, the DQN agent scores them all in a single batched forward pass. Each candidate’s feature vector goes through the network and comes out as a Q-value – the model’s estimate of how much reward fetching that URL will produce.

Candidates are ranked by Q-value. The top pages_per_round candidates survive to be fetched.

Domain profiling

Before a candidate makes it into the final batch, domain-level checks filter out bad sources:

Slow domains: if a domain’s median response time exceeds slow_domain_ms, it is deprioritised. Life is too short to wait for overloaded servers.
Unreliable domains: if a domain’s fetch failure rate exceeds unreliable_threshold, its URLs are skipped. No point hammering a host that keeps returning errors.
Page caps: each domain has a maximum page count (domain_max_pages). Once reached, no more URLs from that domain are selected. This prevents any single site from dominating the crawl.
Already fetched: URLs that have been fetched before are excluded. The frontier tracks visited URLs by hash.

Feature vector

Every URL in the frontier is described by an 11-dimensional feature vector. These features are the inputs to the DQN and the splitting criteria for the TRES tree.

#	Feature	Description
1	depth	Link depth from the nearest seed URL
2	domain_relevance	Average score of pages already fetched from this domain
3	domain_count	Number of pages fetched from this domain so far
4	domain_failure_rate	Fraction of fetches from this domain that failed
5	parent_score	Relevance score of the page that linked to this URL
6	anchor_similarity	Embedding similarity between anchor text and the reference
7	sibling_score	Average score of other URLs discovered on the same parent page
8	path_depth	Number of `/` segments in the URL path
9	has_query	Whether the URL contains query parameters (0 or 1)
10	domain_weight	User-configured weight for this domain (from `fetch.domain_weights`)
11	round_discovered	The crawl round in which this URL was first seen

Selection flow

flowchart TD
    frontier["Frontier tree<br/>(all pending URLs)"]
    sample["Sample 1 candidate<br/>per leaf node"]
    dqn["DQN forward pass<br/>(batched Q-values)"]
    rank["Rank by Q-value"]
    filter["Filter out:<br/>- capped domains<br/>- unreliable domains<br/>- slow domains<br/>- already fetched"]
    batch["Final batch<br/>(pages_per_round URLs)"]

    frontier --> sample --> dqn --> rank --> filter --> batch

    style frontier fill:#2d2d2d,stroke:#555,color:#eee
    style sample fill:#1a3a4a,stroke:#4a9,color:#eee
    style dqn fill:#1a3a4a,stroke:#4a9,color:#eee
    style rank fill:#1a3a4a,stroke:#4a9,color:#eee
    style filter fill:#3a1a1a,stroke:#a44,color:#eee
    style batch fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [select] config reference for all tuneable fields.

Keyboard shortcuts