Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

[parse] – Parse Configuration

The [parse] section controls how HTML is processed after fetching: text extraction limits, link discovery, anchor embedding, and optional field extraction. See Parse for how these parameters fit into the parsing pipeline.

Fields

[parse]
parse_timeout_ms      = 500
max_html_bytes        = 512000
max_links_per_page    = { value = 200, mode = "auto" }
anchor_batch_size     = { value = 512, mode = "fixed" }
embed_threshold_factor = { value = 0.5, mode = "auto" }
propagation_threshold = { value = 0.0, mode = "auto" }
FieldTypeDefaultDescription
parse_timeout_msu64500Timeout for HTML parsing per page (ms)
max_html_bytesusize512000Max HTML body size to process (~500 KB)
max_links_per_pageParam<usize>auto(200)Max outbound links to extract per page
anchor_batch_sizeParam<usize>fixed(512)Batch size for GPU anchor text embedding
embed_threshold_factorParam<f64>auto(0.5)Minimum keyword match score for an anchor to be embedded
propagation_thresholdParam<f64>auto(0.0)Minimum parent score for child links to enter the frontier

max_links_per_page caps how many outbound links are extracted from a single page. In auto mode, the crawler adjusts this based on whether pages with many links tend to produce useful frontier candidates. Pages with navigation menus listing hundreds of links often produce diminishing returns.

Anchor embedding

anchor_batch_size controls how many anchor texts are embedded in a single GPU pass. Larger batches are more efficient but use more memory. embed_threshold_factor sets how closely an anchor’s text must match your keywords before it is worth embedding – anchors below this threshold get a zero similarity score without the GPU cost.

Propagation threshold

propagation_threshold gates whether a page’s child links are added to the frontier. At 0.0 (the default), all links propagate. As the model learns, it raises this threshold to keep the frontier focused on promising regions. Low-scoring pages stop contributing new URLs.

Optional: extract_url_pattern

A regex pattern for extracting URLs from page text (not just <a> tags). Useful for pages that embed URLs in JavaScript or plain text.

[parse]
extract_url_pattern = "https://example\\.com/programs/\\d+"

[[parse.extract_fields]]

Define CSS selectors to extract structured fields from pages.

[[parse.extract_fields]]
name = "tuition"
selector = ".tuition-fee"
attribute = "text"

[[parse.extract_fields]]
name = "deadline"
selector = "span.deadline"
attribute = "text"

[[parse.extract_fields]]
name = "apply_link"
selector = "a.apply-button"
attribute = "href"
FieldTypeDefaultDescription
namestrField name in the output
selectorstrCSS selector to match elements
attributestrtextWhich attribute to extract (text, href, etc.)