Your First Crawl
This walkthrough creates a crawl that searches for European master’s programmes in process philosophy. Adapt the reference text and terms to your own use case.
1. Create a package
forager new phil-ma --reference "European master's programme in process philosophy, \
continental philosophy, metaphysics of becoming. Thinkers: Whitehead, Deleuze, \
Simondon, Bergson. MA or MSc with faculty working on process thought."
This creates data/phil-ma/ with a generated config file. The --reference text becomes the semantic anchor that pages are scored against. You can also start from a hand-written config:
forager new phil-ma --config configs/uni.toml
2. Edit the config
Open data/phil-ma/config.toml and review the generated settings. A minimal config looks like this:
[target]
name = "phil-ma"
[score]
relevance_threshold = { value = 0.10, mode = "auto" }
[[score.groups]]
name = "philosophy"
required = true
weight = 1.0
terms = [
{ text = "process philosophy", weight = 3.0 },
{ text = "Whitehead", weight = 2.5 },
{ text = "continental philosophy", weight = 2.0 },
]
[[score.groups]]
name = "program"
required = true
weight = 2.0
terms = [
{ text = "master programme", weight = 3.0 },
{ text = "MA philosophy", weight = 2.5 },
{ text = "ECTS", weight = 2.0 },
]
[score.semantic]
reference = """
I am looking for a European master's programme in process philosophy
or continental philosophy, focused on becoming, process, and practice.
"""
weight = { value = 0.7, mode = "range", min = 0.3, max = 0.9 }
[fetch]
seed_urls = [
"https://www.mastersportal.com/search/master/philosophy/europe",
"https://www.findamasters.com/masters-degrees/philosophy/",
"https://philosophicalgourmet.com/overall-rankings/",
]
concurrency = 64
pages_per_round = 50
Key points:
- Term groups with
required = truemeans a page must match both groups to score well. A Wikipedia article about Whitehead (philosophy only, no programme info) scores low. An MSc in data science (programme only, wrong topic) scores low. An MA in process philosophy matches both groups and scores high. mode = "auto"andmode = "range"mark parameters that the system adjusts during crawling. Fixed values stay fixed.referenceis embedded with MiniLM-L6-v2 and compared against page content. The reference embedding itself adapts over time toward high-scoring pages.
3. Import additional terms (optional)
If you have a list of terms in a CSV file, import them:
forager import phil-ma -t terms.csv
The CSV can be simple (one term per line) or structured (group,text,weight with a header row). Example terms.csv:
group,text,weight
philosophy,speculative realism,1.5
philosophy,radical empiricism,1.5
philosophy,non-representational theory,1.0
program,postgraduate,2.0
program,admission,1.5
Imported terms merge with config terms. Config terms always take priority.
4. Run the crawl
forager run phil-ma
The crawler starts fetching seed URLs, scoring pages, training the DQN agent, and expanding the frontier. Output shows round-by-round progress:
[round 1] fetched 50 scored 47 relevant 3 frontier 892 epsilon 0.98
[round 2] fetched 50 scored 48 relevant 5 frontier 1304 epsilon 0.95
[round 3] fetched 50 scored 46 relevant 8 frontier 1687 epsilon 0.91
...
- fetched – pages downloaded this round
- scored – pages successfully parsed and scored
- relevant – pages above the relevance threshold
- frontier – total URLs queued for future rounds
- epsilon – DQN exploration rate (starts high, decays toward 0.1)
Stop the crawl with Ctrl-C. It saves state automatically. Resume later:
forager run phil-ma # resumes the latest run
forager run phil-ma --new # starts a fresh run (keeps data)
5. Check status
forager status phil-ma
Shows learned parameters, crawl statistics, top domains, and score distributions. Useful for deciding whether to keep running or adjust the config.
6. Query results
Use GQL queries to explore the database directly:
forager query phil-ma "MATCH (p:Page) WHERE p.score > 0.3 RETURN p.url, p.score ORDER BY p.score DESC LIMIT 20"
forager query phil-ma "MATCH (d:Domain) RETURN d.name, d.fetches, d.reward_sum ORDER BY d.fetches DESC"
For richer analysis from Julia (or any language with C FFI), forager-db builds as a shared library. See Database Model for details.
Other useful commands
forager list # show all packages
forager log phil-ma # run history for this package
forager tune phil-ma semantic_weight 0.8 # override a learned parameter
For a full command reference, see CLI Commands.