Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Select

The select stage decides which URLs to fetch next. Out of potentially thousands of candidates sitting in the frontier, it needs to pick a small batch that balances exploration (trying new domains and paths) with exploitation (fetching URLs the model thinks will score well). It does this through tree-based partitioning, neural scoring, and domain-level filtering.

TRES tree partitioning

The frontier is organized as a CART regression tree (called TRES – Tree-based Reward Estimation for Sampling). Each URL in the frontier has an 11-dimensional feature vector. The tree splits on these features using reward as the regression target, creating leaf nodes that group similar URLs together.

At selection time, forager samples one candidate from each leaf. This guarantees diversity – even if 90% of the frontier is from a single domain, the tree forces the batch to include URLs from different regions of feature space.

DQN scoring

Once candidates are sampled from the tree, the DQN agent scores them all in a single batched forward pass. Each candidate’s feature vector goes through the network and comes out as a Q-value – the model’s estimate of how much reward fetching that URL will produce.

Candidates are ranked by Q-value. The top pages_per_round candidates survive to be fetched.

Domain profiling

Before a candidate makes it into the final batch, domain-level checks filter out bad sources:

  • Slow domains: if a domain’s median response time exceeds slow_domain_ms, it is deprioritised. Life is too short to wait for overloaded servers.
  • Unreliable domains: if a domain’s fetch failure rate exceeds unreliable_threshold, its URLs are skipped. No point hammering a host that keeps returning errors.
  • Page caps: each domain has a maximum page count (domain_max_pages). Once reached, no more URLs from that domain are selected. This prevents any single site from dominating the crawl.
  • Already fetched: URLs that have been fetched before are excluded. The frontier tracks visited URLs by hash.

Feature vector

Every URL in the frontier is described by an 11-dimensional feature vector. These features are the inputs to the DQN and the splitting criteria for the TRES tree.

#FeatureDescription
1depthLink depth from the nearest seed URL
2domain_relevanceAverage score of pages already fetched from this domain
3domain_countNumber of pages fetched from this domain so far
4domain_failure_rateFraction of fetches from this domain that failed
5parent_scoreRelevance score of the page that linked to this URL
6anchor_similarityEmbedding similarity between anchor text and the reference
7sibling_scoreAverage score of other URLs discovered on the same parent page
8path_depthNumber of / segments in the URL path
9has_queryWhether the URL contains query parameters (0 or 1)
10domain_weightUser-configured weight for this domain (from fetch.domain_weights)
11round_discoveredThe crawl round in which this URL was first seen

Selection flow

flowchart TD
    frontier["Frontier tree<br/>(all pending URLs)"]
    sample["Sample 1 candidate<br/>per leaf node"]
    dqn["DQN forward pass<br/>(batched Q-values)"]
    rank["Rank by Q-value"]
    filter["Filter out:<br/>- capped domains<br/>- unreliable domains<br/>- slow domains<br/>- already fetched"]
    batch["Final batch<br/>(pages_per_round URLs)"]

    frontier --> sample --> dqn --> rank --> filter --> batch

    style frontier fill:#2d2d2d,stroke:#555,color:#eee
    style sample fill:#1a3a4a,stroke:#4a9,color:#eee
    style dqn fill:#1a3a4a,stroke:#4a9,color:#eee
    style rank fill:#1a3a4a,stroke:#4a9,color:#eee
    style filter fill:#3a1a1a,stroke:#a44,color:#eee
    style batch fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [select] config reference for all tuneable fields.