I have some domains where I only want to crawl a specific prefix.
The old crawler allowed that prefix to be included in the site list:
http://example.invalid/foo/
however it made the sitelist a bit clunky.
Instead, I want to create a list of site specific regular expressions that'll be used to allow crawling if a URL matches.
It should be site specific so that the rules aren't tested against other domains - there's no point slowing crawls of other domains down testing regexes that can never match
Activity
30-Dec-23 11:27
assigned to @btasker
30-Dec-23 11:27
The rules will go into
config/site-allowregexes.txt
with a rule per line.Rule format is
30-Dec-23 11:32
The rules parser will, by default, allow the very root of the domain to be indexed:
Without this rule, the crawler will refuse to crawl the domain at all.
It's also worth noting that the allow, by definition, has to match on every level of recursion.
For example, with the following dir structure:
If the following ruleset were applied
The contents of
bar
would not be indexed. In order to get tobar
, we'd first need to crawlfoo
and there isn't a rule which matches it alone. We'd need either a more complex regex, or two rules30-Dec-23 11:32
mentioned in commit 3299289c55d517f10f64f22565970581f57f7e7d
Message
feat: implement support for domain specific allowlisting (utilities/file_location_listing#17)