Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parse

The parse stage extracts useful information from downloaded HTML. It pulls out text for scoring, discovers new URLs for the frontier, and optionally extracts structured fields using CSS selectors.

Text extraction

Each page produces three text signals, extracted separately because they carry different amounts of information:

  • Title: the content of the <title> tag. Usually the most concentrated signal about what a page is about.
  • Headings: text from <h1> through <h6> elements, concatenated. These capture the page’s structure and topic hierarchy.
  • Body: the visible text content after stripping tags, scripts, and styles. The broadest signal but also the noisiest.

These three signals flow into the score stage where they are weighted independently. The weights are learnable – the model figures out whether title or body matters more for your particular crawl.

Every <a href> in the page is a potential new URL for the frontier. The parser extracts all outbound links and normalises them (resolving relative URLs, stripping fragments). Links are filtered through the configured filter chain before being added to the frontier.

Anchor text – the clickable text of a link – is particularly valuable. If the anchor text says “PhD programme in philosophy”, that is a strong hint about what is on the other side of that link.

Anchor embedding

Links whose anchor text matches any of your configured keywords get their anchor text embedded using the same MiniLM model used for scoring. This is done in a batched GPU pass (anchor_batch_size controls the batch size) for efficiency.

The resulting embedding similarity becomes the anchor_similarity feature in the URL’s feature vector, giving the DQN a semantic signal about whether a link is worth following.

Field extraction

For structured crawling, you can define CSS selectors to extract specific fields from pages:

[[parse.extract_fields]]
name = "tuition"
selector = ".tuition-fee"
attribute = "text"

This is optional and primarily useful when you want to pull structured data (prices, dates, names) out of pages alongside the relevance scoring.

Adaptive HTML truncation

Not all pages are equal in size. Some domains serve lean HTML; others dump megabytes of JavaScript and boilerplate. The parser enforces a per-domain HTML size limit that adapts over time:

  • max_html_bytes: hard cap on how much HTML to process (default: 512KB).
  • html_limit_factor: a learnable multiplier applied per domain. Domains with useful content in large pages get a higher factor; domains where the first few KB contain everything useful get a lower one.

This prevents the crawler from wasting time parsing bloated pages that do not contribute useful text.

Propagation threshold

Not every discovered link deserves to enter the frontier. The propagation_threshold parameter sets a minimum parent score – if the page that contains the link scored below this threshold, its child links are not added to the frontier.

This is adaptive: early in the crawl when the model is still learning, the threshold is low (let everything through). As the crawler gets better at predicting relevance, the threshold rises, keeping the frontier focused on promising regions of the web.

Configuration

See the [parse] config reference for all tuneable fields.