#17 Site specific path allowlist : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: v0.1

Created: 30-Dec-23 11:27

Labels: Fixed/Done New Feature

Description

I have some domains where I only want to crawl a specific prefix.

The old crawler allowed that prefix to be included in the site list:

http://example.invalid/foo/

however it made the sitelist a bit clunky.

Instead, I want to create a list of site specific regular expressions that'll be used to allow crawling if a URL matches.

It should be site specific so that the rules aren't tested against other domains - there's no point slowing crawls of other domains down testing regexes that can never match

Toggle State Changes

Activity

btasker Permalink
30-Dec-23 11:27

assigned to @btasker

btasker Permalink
30-Dec-23 11:27

The rules will go into config/site-allowregexes.txt with a rule per line.

Rule format is

domainname!#!regex

btasker Permalink
30-Dec-23 11:32

The rules parser will, by default, allow the very root of the domain to be indexed:

f'^http(s)?://{dom}$'

Without this rule, the crawler will refuse to crawl the domain at all.

It's also worth noting that the allow, by definition, has to match on every level of recursion.

For example, with the following dir structure:

/ 
  - foo
    -- bar

If the following ruleset were applied

example.invalid!#!https://example\.invalid/foo/bar/.*

The contents of bar would not be indexed. In order to get to bar, we'd first need to crawl foo and there isn't a rule which matches it alone. We'd need either a more complex regex, or two rules

example.invalid!#!https://example\.invalid/foo/$
example.invalid!#!https://example\.invalid/foo/bar/.*

btasker Permalink
30-Dec-23 11:32

verified

mentioned in commit 3299289c55d517f10f64f22565970581f57f7e7d

Commit: 3299289c55d517f10f64f22565970581f57f7e7d 
Author: B Tasker                            
                            
Date: 2023-12-30T11:28:11.000+00:00

Message

feat: implement support for domain specific allowlisting (utilities/file_location_listing#17)

+46 -2 (48 lines changed)

utilities/file_location_listing#17: Site specific path allowlist

Issue Information

Activity