Score
The score stage determines how relevant each fetched page is to your search. It combines two complementary approaches – semantic embedding similarity and keyword density – into a single score between 0 and 1.
Semantic similarity
Forager uses MiniLM-L6-v2 to compute sentence embeddings. Your reference description is embedded once at startup, and each page’s text is embedded at score time. Relevance is the cosine similarity between the two.
This captures meaning rather than exact words. A reference about “European master’s programme in philosophy” will match pages that say “MA in continental thought” even though there is no keyword overlap.
Anti-reference
You can also provide an anti-reference – a description of what you do not want. Pages similar to the anti-reference get penalised. This is useful when your topic overlaps with something you want to exclude (e.g., you want process philosophy but not analytic philosophy).
The anti-reference penalty is subtracted from the similarity score, weighted by anti_weight.
Multi-signal blend
Page text is not treated as a single blob. The parser extracts three signals – title, heading, and body – and each gets its own embedding and similarity score. The final semantic affinity is a weighted sum:
affinity = w_title * sim_title + w_heading * sim_heading + w_body * sim_body
The weights (w_title, w_heading, w_body) are learnable parameters that sum to 1. The model discovers whether title or body is more informative for your particular crawl. For academic programme pages, title tends to matter most. For blog posts, body might dominate.
Keyword matching
Keyword scoring uses term groups. Each group contains related terms with individual weights. A page’s keyword density for a group is the weighted count of matching terms normalised by text length.
When multiple groups are marked required, their scores are combined with a geometric mean. This means a page must match all required groups to score well – matching just one group drives the score toward zero.
density = (group_1_score * group_2_score * ... * group_n_score) ^ (1/n)
This is powerful for intersection queries: “pages about philosophy AND master’s programmes” eliminates Wikipedia articles (philosophy only) and generic university listings (programmes only).
Final score formula
The two signals are blended with a learnable weight:
score = sem_w * affinity + (1 - sem_w) * density
Where sem_w is the semantic weight (default 0.7, learnable in range 0.3–0.9). When keywords are weak, the model can lean on embeddings, and vice versa.
Language penalty
If you configure target languages (e.g., languages = ["en", "de"]), pages detected in other languages receive a penalty. The lang_penalty parameter (default: 0.0, adaptive) controls how harshly off-language pages are penalised. The model learns the right penalty strength – in multilingual crawls it stays low; in monolingual crawls it ramps up.
Reference blending
The reference embedding is not static. After each round, it drifts slightly toward the embeddings of pages that scored well. The reference_blend parameter controls the drift rate:
reference = (1 - blend) * reference + blend * mean(relevant_embeddings)
This lets the crawler refine its understanding of what “relevant” means based on what it actually finds. If your initial reference was vague, the embedding sharpens over time as the crawler discovers concrete examples.
Scoring pipeline
flowchart TD
page["Fetched page"]
extract["Extract title,<br/>headings, body"]
embed["Embed each signal<br/>(MiniLM-L6-v2)"]
sim["Cosine similarity<br/>vs reference"]
anti["Anti-reference<br/>penalty"]
blend["Signal blend<br/>(learnable weights)"]
kw["Keyword density<br/>(term groups)"]
final["Final score<br/>sem_w * affinity +<br/>(1 - sem_w) * density"]
lang["Language<br/>penalty"]
out["Page score<br/>(0 to 1)"]
page --> extract --> embed --> sim
sim --> blend
anti --> blend
blend --> final
kw --> final
final --> lang --> out
style page fill:#2d2d2d,stroke:#555,color:#eee
style extract fill:#1a3a4a,stroke:#4a9,color:#eee
style embed fill:#1a3a4a,stroke:#4a9,color:#eee
style sim fill:#1a3a4a,stroke:#4a9,color:#eee
style anti fill:#3a1a1a,stroke:#a44,color:#eee
style blend fill:#1a3a4a,stroke:#4a9,color:#eee
style kw fill:#1a3a4a,stroke:#4a9,color:#eee
style final fill:#1a3a1a,stroke:#4a4,color:#eee
style lang fill:#3a1a1a,stroke:#a44,color:#eee
style out fill:#1a3a1a,stroke:#4a4,color:#eee
Configuration
See the [score] config reference for all tuneable fields including term groups, semantic settings, and signal weights.