[select] – Select Configuration
The [select] section controls how URLs are sampled from the frontier. It governs the TRES tree structure and domain-level filtering. See Select for how these parameters fit into the selection process.
Fields
[select]
max_depth = 18
min_samples_per_split = 5
domain_max_pages = { value = 100, mode = "auto" }
slow_domain_ms = { value = 3000.0, mode = "auto" }
unreliable_threshold = { value = 0.3, mode = "auto" }
html_limit_factor = { value = 2.0, mode = "auto" }
| Field | Type | Default | Description |
|---|---|---|---|
max_depth | usize | 18 | Maximum depth of the TRES regression tree |
min_samples_per_split | usize | 5 | Minimum samples in a leaf before it can split |
domain_max_pages | Param<usize> | auto(100) | Maximum pages to fetch from any single domain |
slow_domain_ms | Param<f64> | auto(3000) | Response time threshold (ms) above which a domain is “slow” |
unreliable_threshold | Param<f64> | auto(0.3) | Failure rate above which a domain is considered unreliable |
html_limit_factor | Param<f64> | auto(2.0) | Per-domain multiplier for the HTML size cap |
Tree parameters
max_depth and min_samples_per_split control the TRES tree shape. A deeper tree creates more leaf nodes, which means more diverse sampling but fewer candidates per leaf. If your frontier is small (under a few hundred URLs), lower max_depth to avoid over-partitioning.
Domain filtering
domain_max_pages prevents any single site from dominating the crawl. In auto mode, the crawler adjusts this cap based on how productive each domain is – domains that consistently produce relevant pages get a higher cap.
slow_domain_ms and unreliable_threshold define what counts as a bad domain. Both are adaptive: if the crawler finds that being stricter improves results, it tightens these thresholds.
HTML limit factor
html_limit_factor scales the global max_html_bytes on a per-domain basis. A factor of 2.0 means domains are allowed up to twice the base limit. The crawler adjusts this per domain based on whether larger pages from that domain tend to contain useful content.