Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Your First Crawl

This walkthrough creates a crawl that searches for European master’s programmes in process philosophy. Adapt the reference text and terms to your own use case.

1. Create a package

forager new phil-ma --reference "European master's programme in process philosophy, \
continental philosophy, metaphysics of becoming. Thinkers: Whitehead, Deleuze, \
Simondon, Bergson. MA or MSc with faculty working on process thought."

This creates data/phil-ma/ with a generated config file. The --reference text becomes the semantic anchor that pages are scored against. You can also start from a hand-written config:

forager new phil-ma --config configs/uni.toml

2. Edit the config

Open data/phil-ma/config.toml and review the generated settings. A minimal config looks like this:

[target]
name = "phil-ma"

[score]
relevance_threshold = { value = 0.10, mode = "auto" }

[[score.groups]]
name = "philosophy"
required = true
weight = 1.0
terms = [
    { text = "process philosophy", weight = 3.0 },
    { text = "Whitehead", weight = 2.5 },
    { text = "continental philosophy", weight = 2.0 },
]

[[score.groups]]
name = "program"
required = true
weight = 2.0
terms = [
    { text = "master programme", weight = 3.0 },
    { text = "MA philosophy", weight = 2.5 },
    { text = "ECTS", weight = 2.0 },
]

[score.semantic]
reference = """
I am looking for a European master's programme in process philosophy
or continental philosophy, focused on becoming, process, and practice.
"""
weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }

[fetch]
seed_urls = [
    "https://www.mastersportal.com/search/master/philosophy/europe",
    "https://www.findamasters.com/masters-degrees/philosophy/",
    "https://philosophicalgourmet.com/overall-rankings/",
]
concurrency = 64
pages_per_round = 50

Key points:

  • Term groups with required = true means a page must match both groups to score well. A Wikipedia article about Whitehead (philosophy only, no programme info) scores low. An MSc in data science (programme only, wrong topic) scores low. An MA in process philosophy matches both groups and scores high.
  • mode = "auto" and mode = "range" mark parameters that the system adjusts during crawling. Fixed values stay fixed.
  • reference is embedded with MiniLM-L6-v2 and compared against page content. The reference embedding itself adapts over time toward high-scoring pages.

3. Import additional terms (optional)

If you have a list of terms in a CSV file, import them:

forager import phil-ma -t terms.csv

The CSV can be simple (one term per line) or structured (group,text,weight with a header row). Example terms.csv:

group,text,weight
philosophy,speculative realism,1.5
philosophy,radical empiricism,1.5
philosophy,non-representational theory,1.0
program,postgraduate,2.0
program,admission,1.5

Imported terms merge with config terms. Config terms always take priority.

4. Run the crawl

forager run phil-ma

The crawler starts fetching seed URLs, scoring pages, training the DQN agent, and expanding the frontier. Output shows round-by-round progress:

[round 1]  fetched 50  scored 47  relevant 3  frontier 892  epsilon 0.98
[round 2]  fetched 50  scored 48  relevant 5  frontier 1304  epsilon 0.95
[round 3]  fetched 50  scored 46  relevant 8  frontier 1687  epsilon 0.91
...
  • fetched – pages downloaded this round
  • scored – pages successfully parsed and scored
  • relevant – pages above the relevance threshold
  • frontier – total URLs queued for future rounds
  • epsilon – DQN exploration rate (starts high, decays toward 0.1)

Stop the crawl with Ctrl-C. It saves state automatically. Resume later:

forager run phil-ma           # resumes the latest run
forager run phil-ma --new     # starts a fresh run (keeps data)

5. Check status

forager status phil-ma

Shows learned parameters, crawl statistics, top domains, and score distributions. Useful for deciding whether to keep running or adjust the config.

6. Query results

Use GQL queries to explore the database directly:

forager query phil-ma "MATCH (p:Page) WHERE p.score > 0.3 RETURN p.url, p.score ORDER BY p.score DESC LIMIT 20"
forager query phil-ma "MATCH (d:Domain) RETURN d.name, d.fetches, d.reward_sum ORDER BY d.fetches DESC"

For richer analysis from Julia (or any language with C FFI), forager-db builds as a shared library. See Database Model for details.

Other useful commands

forager list                        # show all packages
forager log phil-ma                 # run history for this package
forager tune phil-ma semantic_weight 0.8   # override a learned parameter

For a full command reference, see CLI Commands.