Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

[select] – Select Configuration

The [select] section controls how URLs are sampled from the frontier. It governs the TRES tree structure and domain-level filtering. See Select for how these parameters fit into the selection process.

Fields

[select]
max_depth             = 18
min_samples_per_split = 5
domain_max_pages      = { value = 100, mode = "auto" }
slow_domain_ms        = { value = 3000.0, mode = "auto" }
unreliable_threshold  = { value = 0.3, mode = "auto" }
html_limit_factor     = { value = 2.0, mode = "auto" }
FieldTypeDefaultDescription
max_depthusize18Maximum depth of the TRES regression tree
min_samples_per_splitusize5Minimum samples in a leaf before it can split
domain_max_pagesParam<usize>auto(100)Maximum pages to fetch from any single domain
slow_domain_msParam<f64>auto(3000)Response time threshold (ms) above which a domain is “slow”
unreliable_thresholdParam<f64>auto(0.3)Failure rate above which a domain is considered unreliable
html_limit_factorParam<f64>auto(2.0)Per-domain multiplier for the HTML size cap

Tree parameters

max_depth and min_samples_per_split control the TRES tree shape. A deeper tree creates more leaf nodes, which means more diverse sampling but fewer candidates per leaf. If your frontier is small (under a few hundred URLs), lower max_depth to avoid over-partitioning.

Domain filtering

domain_max_pages prevents any single site from dominating the crawl. In auto mode, the crawler adjusts this cap based on how productive each domain is – domains that consistently produce relevant pages get a higher cap.

slow_domain_ms and unreliable_threshold define what counts as a bad domain. Both are adaptive: if the crawler finds that being stricter improves results, it tightens these thresholds.

HTML limit factor

html_limit_factor scales the global max_html_bytes on a per-domain basis. A factor of 2.0 means domains are allowed up to twice the base limit. The crawler adjusts this per domain based on whether larger pages from that domain tend to contain useful content.