project Utilities / File Location Listing avatar

utilities/file_location_listing#17: Site specific path allowlist

Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 30-Dec-23 11:27


I have some domains where I only want to crawl a specific prefix.

The old crawler allowed that prefix to be included in the site list:


however it made the sitelist a bit clunky.

Instead, I want to create a list of site specific regular expressions that'll be used to allow crawling if a URL matches.

It should be site specific so that the rules aren't tested against other domains - there's no point slowing crawls of other domains down testing regexes that can never match

Toggle State Changes


assigned to @btasker

The rules will go into config/site-allowregexes.txt with a rule per line.

Rule format is


The rules parser will, by default, allow the very root of the domain to be indexed:


Without this rule, the crawler will refuse to crawl the domain at all.

It's also worth noting that the allow, by definition, has to match on every level of recursion.

For example, with the following dir structure:

  - foo
    -- bar 

If the following ruleset were applied


The contents of bar would not be indexed. In order to get to bar, we'd first need to crawl foo and there isn't a rule which matches it alone. We'd need either a more complex regex, or two rules


mentioned in commit 3299289c55d517f10f64f22565970581f57f7e7d

Commit: 3299289c55d517f10f64f22565970581f57f7e7d 
Author: B Tasker                            
Date: 2023-12-30T11:28:11.000+00:00 


feat: implement support for domain specific allowlisting (utilities/file_location_listing#17)

+46 -2 (48 lines changed)