Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Score

The score stage determines how relevant each fetched page is to your search. It combines two complementary approaches – semantic embedding similarity and keyword density – into a single score between 0 and 1.

Semantic similarity

Forager uses MiniLM-L6-v2 to compute sentence embeddings. Your reference description is embedded once at startup, and each page’s text is embedded at score time. Relevance is the cosine similarity between the two.

This captures meaning rather than exact words. A reference about “European master’s programme in philosophy” will match pages that say “MA in continental thought” even though there is no keyword overlap.

Anti-reference

You can also provide an anti-reference – a description of what you do not want. Pages similar to the anti-reference get penalised. This is useful when your topic overlaps with something you want to exclude (e.g., you want process philosophy but not analytic philosophy).

The anti-reference penalty is subtracted from the similarity score, weighted by anti_weight.

Multi-signal blend

Page text is not treated as a single blob. The parser extracts three signals – title, heading, and body – and each gets its own embedding and similarity score. The final semantic affinity is a weighted sum:

affinity = w_title * sim_title + w_heading * sim_heading + w_body * sim_body

The weights (w_title, w_heading, w_body) are learnable parameters that sum to 1. The model discovers whether title or body is more informative for your particular crawl. For academic programme pages, title tends to matter most. For blog posts, body might dominate.

Keyword matching

Keyword scoring uses term groups. Each group contains related terms with individual weights. A page’s keyword density for a group is the weighted count of matching terms normalised by text length.

When multiple groups are marked required, their scores are combined with a geometric mean. This means a page must match all required groups to score well – matching just one group drives the score toward zero.

density = (group_1_score * group_2_score * ... * group_n_score) ^ (1/n)

This is powerful for intersection queries: “pages about philosophy AND master’s programmes” eliminates Wikipedia articles (philosophy only) and generic university listings (programmes only).

Final score formula

The two signals are blended with a learnable weight:

score = sem_w * affinity + (1 - sem_w) * density

Where sem_w is the semantic weight (default 0.7, learnable in range 0.3–0.9). When keywords are weak, the model can lean on embeddings, and vice versa.

Language penalty

If you configure target languages (e.g., languages = ["en", "de"]), pages detected in other languages receive a penalty. The lang_penalty parameter (default: 0.0, adaptive) controls how harshly off-language pages are penalised. The model learns the right penalty strength – in multilingual crawls it stays low; in monolingual crawls it ramps up.

Reference blending

The reference embedding is not static. After each round, it drifts slightly toward the embeddings of pages that scored well. The reference_blend parameter controls the drift rate:

reference = (1 - blend) * reference + blend * mean(relevant_embeddings)

This lets the crawler refine its understanding of what “relevant” means based on what it actually finds. If your initial reference was vague, the embedding sharpens over time as the crawler discovers concrete examples.

Scoring pipeline

flowchart TD
    page["Fetched page"]
    extract["Extract title,<br/>headings, body"]
    embed["Embed each signal<br/>(MiniLM-L6-v2)"]
    sim["Cosine similarity<br/>vs reference"]
    anti["Anti-reference<br/>penalty"]
    blend["Signal blend<br/>(learnable weights)"]
    kw["Keyword density<br/>(term groups)"]
    final["Final score<br/>sem_w * affinity +<br/>(1 - sem_w) * density"]
    lang["Language<br/>penalty"]
    out["Page score<br/>(0 to 1)"]

    page --> extract --> embed --> sim
    sim --> blend
    anti --> blend
    blend --> final
    kw --> final
    final --> lang --> out

    style page fill:#2d2d2d,stroke:#555,color:#eee
    style extract fill:#1a3a4a,stroke:#4a9,color:#eee
    style embed fill:#1a3a4a,stroke:#4a9,color:#eee
    style sim fill:#1a3a4a,stroke:#4a9,color:#eee
    style anti fill:#3a1a1a,stroke:#a44,color:#eee
    style blend fill:#1a3a4a,stroke:#4a9,color:#eee
    style kw fill:#1a3a4a,stroke:#4a9,color:#eee
    style final fill:#1a3a1a,stroke:#4a4,color:#eee
    style lang fill:#3a1a1a,stroke:#a44,color:#eee
    style out fill:#1a3a1a,stroke:#4a4,color:#eee

Configuration

See the [score] config reference for all tuneable fields including term groups, semantic settings, and signal weights.