[parse] – Parse Configuration
The [parse] section controls how HTML is processed after fetching: text extraction limits, link discovery, anchor embedding, and optional field extraction. See Parse for how these parameters fit into the parsing pipeline.
Fields
[parse]
parse_timeout_ms = 500
max_html_bytes = 512000
max_links_per_page = { value = 200, mode = "auto" }
anchor_batch_size = { value = 512, mode = "fixed" }
embed_threshold_factor = { value = 0.5, mode = "auto" }
propagation_threshold = { value = 0.0, mode = "auto" }
| Field | Type | Default | Description |
|---|---|---|---|
parse_timeout_ms | u64 | 500 | Timeout for HTML parsing per page (ms) |
max_html_bytes | usize | 512000 | Max HTML body size to process (~500 KB) |
max_links_per_page | Param<usize> | auto(200) | Max outbound links to extract per page |
anchor_batch_size | Param<usize> | fixed(512) | Batch size for GPU anchor text embedding |
embed_threshold_factor | Param<f64> | auto(0.5) | Minimum keyword match score for an anchor to be embedded |
propagation_threshold | Param<f64> | auto(0.0) | Minimum parent score for child links to enter the frontier |
Link limits
max_links_per_page caps how many outbound links are extracted from a single page. In auto mode, the crawler adjusts this based on whether pages with many links tend to produce useful frontier candidates. Pages with navigation menus listing hundreds of links often produce diminishing returns.
Anchor embedding
anchor_batch_size controls how many anchor texts are embedded in a single GPU pass. Larger batches are more efficient but use more memory. embed_threshold_factor sets how closely an anchor’s text must match your keywords before it is worth embedding – anchors below this threshold get a zero similarity score without the GPU cost.
Propagation threshold
propagation_threshold gates whether a page’s child links are added to the frontier. At 0.0 (the default), all links propagate. As the model learns, it raises this threshold to keep the frontier focused on promising regions. Low-scoring pages stop contributing new URLs.
Optional: extract_url_pattern
A regex pattern for extracting URLs from page text (not just <a> tags). Useful for pages that embed URLs in JavaScript or plain text.
[parse]
extract_url_pattern = "https://example\\.com/programs/\\d+"
[[parse.extract_fields]]
Define CSS selectors to extract structured fields from pages.
[[parse.extract_fields]]
name = "tuition"
selector = ".tuition-fee"
attribute = "text"
[[parse.extract_fields]]
name = "deadline"
selector = "span.deadline"
attribute = "text"
[[parse.extract_fields]]
name = "apply_link"
selector = "a.apply-button"
attribute = "href"
| Field | Type | Default | Description |
|---|---|---|---|
name | str | – | Field name in the output |
selector | str | – | CSS selector to match elements |
attribute | str | text | Which attribute to extract (text, href, etc.) |