project Utilities / File Location Listing avatar

utilities/file_location_listing#17: Site specific path allowlist



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 30-Dec-23 11:27



Description

I have some domains where I only want to crawl a specific prefix.

The old crawler allowed that prefix to be included in the site list:

http://example.invalid/foo/

however it made the sitelist a bit clunky.

Instead, I want to create a list of site specific regular expressions that'll be used to allow crawling if a URL matches.

It should be site specific so that the rules aren't tested against other domains - there's no point slowing crawls of other domains down testing regexes that can never match



Toggle State Changes

Activity


assigned to @btasker

The rules will go into config/site-allowregexes.txt with a rule per line.

Rule format is

domainname!#!regex

The rules parser will, by default, allow the very root of the domain to be indexed:

f'^http(s)?://{dom}$'

Without this rule, the crawler will refuse to crawl the domain at all.

It's also worth noting that the allow, by definition, has to match on every level of recursion.

For example, with the following dir structure:

/ 
  - foo
    -- bar 

If the following ruleset were applied

example.invalid!#!https://example\.invalid/foo/bar/.*

The contents of bar would not be indexed. In order to get to bar, we'd first need to crawl foo and there isn't a rule which matches it alone. We'd need either a more complex regex, or two rules

example.invalid!#!https://example\.invalid/foo/$
example.invalid!#!https://example\.invalid/foo/bar/.*
verified

mentioned in commit 3299289c55d517f10f64f22565970581f57f7e7d

Commit: 3299289c55d517f10f64f22565970581f57f7e7d 
Author: B Tasker                            
                            
Date: 2023-12-30T11:28:11.000+00:00 

Message

feat: implement support for domain specific allowlisting (utilities/file_location_listing#17)

+46 -2 (48 lines changed)