Wiki: Crawler/Utilities / File Location Listing



Crawls are performed by crawler/app/crawler.py.

The crawler needs write access to the DB directory, the location of which can be defined with environment variable DB_PATH

The crawler also needs access to it's configuration files. By default, these are expected to be in $DB_PATH/config, however the location can be overridden via env var CONFIG_BASE


Configuration Files

do_not_crawl.txt (optional)

Introduced here.

To blocklist a domain, it can to be added to config/do_not_crawl.txt

The file is expected to be a list of domain names (it didn't make sense to include scheme/path etc for this).

The intention is that this list should always be kept quite short, consisting only of domains that it's known crawled content regularly links out to. Using do_not_crawl for those domains bypasses a deletion check, making crawls a little more efficient, but only so long as the blocklist is small (domains not in the blocklist still won't be crawled if they're not in sites.txt)

Adding a previously indexed domain to do_not_crawl.txt will not lead to entries being deleted during the next crawl (although, obviously, re-validation will still be able to remove them).

ignoretags.txt (optional)

A file of tags, with one tag per line.

Any tag listed here will be skipped when building the tags index - you won't be able to search for it.

However, at time of writing, the tag will still be displayed in search results.

path-hints.txt (optional)

Path Hints are a way to tell the crawler about URLs that it might not otherwise be able to discover.

The config is a simple text file with one URL per line

site-allowregexes.txt (optional)

A configuration file of site specific regular expressions. If at least 1 allowregex is provided for a domain, then only URLs matching that domain's allowregex's will be crawled.

Rules are prefixed with the domain name to which they should be applied, followed by !#!. There should be one rule per line.

dom1.example.com!#!https://dom1.example.com/Indexed/.*

sites.txt (required)

A list of schemes and domains to index.

Any URL discovered during crawl must fall within one of the listed domains to be queued for crawling.

Lines beginning with a # are taken to be a comment and ignored

https://example.invalid
http://subdomain.example.invalid

skipregexes.txt (optional)

A list of regular expressions to test URLs against. If a URl matches one of the listed regular expressions, it will not be crawled.

https:\/\/wiki\.example\.invalid\/.*?.*action=diff.*
https:\/\/wiki\.example\.invalid\/.*?.*action=attr.*

Note: these are very powerful, but need to be tested against every URL. If possible, it's better to use skipstrings instead

skipstrings.txt (optional)

A list of strings, one per line.

If any of the lines appears within a URL, it will not be crawled.

.txt~
.bak
.swp
.kate-swp

Crawling

The crawler's default behaviour is to read a list of sites from CONFIG_BASE/config/sites.txt and then recursively crawl them.

Invocation can therefore be as simple as

export DB_PATH="/opt/file-indexdb"
./crawler/app/crawler.py

Once the crawl is complete, the crawler will trigger an index rebuild.

It is also possible to provide the crawler with a single site to crawl

export DB_PATH="/opt/file-indexdb"
./crawler/app/crawler.py [url]

Note that this site must still exist in sites.txt (otherwise it will fail crawl constraints).

Crawls are intended to run as a cron-job, and the kubernetes config includes a CronJob to that effect.