project Utilities / File Location Listing avatar

utilities/file_location_listing#45: Domain Blocklist



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.6
Created: 26-Jan-24 18:16



Description

This is intended to deliver a (smallish) efficiency gain to the crawler.

In my notes, I tend to link to Github a lot - it plays host to a lot of the things that I might want to refer on to.

Github, obviously, isn't in my site allowlists, but every Github link picked up during a crawl needs to be checked against a list of approved domains and regexes (admittedly, we do it a bit more efficiently than that).

What'd be good, is if it were possible to have a very small list of "blocked" domains so that we can short-circuit past the allow checks (or at least look at whether that delivers any kind of saving at all)



Toggle State Changes

Activity


assigned to @btasker

Of course, I guess the counter argument is that it could also be viewed as adding additional overhead to all the URLs that are permitted.

Looking at shouldCrawlURL() there is the potential for a little bit of saving.

There's not much to be made in terms of the lookup against authorised sites

if parsed_host.lower() not in SITE_LIST:

realistically, we'd be searching a list of 1 or 2 items instead of a list of 10. Technically a saving, but likely to be immeasurably small.

But... if the URL isn't authorised, the function will then queue up a delete for any item stored against that URL - the idea being that if the URL was previously authorised, we want to remove it.

deleteStoredItem() will then

  • hash the URL
  • Check whether an entry exists in storage for the URL
  • Remove it if it does

If we added a blocklist, we'd remove the need for hashing.

However, it might be better to simply negate the need for hashing in the first place: we could instead attempt to look the item up in the index (notes on that in the next comment)

we could instead attempt to look the item up in the index (notes on that in the next comment)

The challenge with that approach, though, is what if there isn't an index?

  1. We could assume that it doesn't exist, so not attempt to pursue a delete
  2. We could fall back to the original behaviour

The problem with 1) is that it means that, if the index has been removed (for whatever reason), recently removed sites won't get purged from storage until a subsequent crawl. It's an edge case, but definitely undesirable behaviour.

The problem with 2) is that, whilst it would probably work, we're adding complexity (and therefore the cost of handling stuff that we're not going to index anyway)

I guess the other option, and I'm not sure I like the sound it, would be to move the deletion check to be part of the reindexing process:

  • iterate through storage items (as now)
  • check each against the permissions list
  • Delete any that don't match

It would add an inordinate amount of expense to index builds (because we'd have to check every key in the DB rather than simply checking stuff as it's crawled), but it would mean that stuff was removed more quickly after a domain was removed from the crawl-list.

I think the logical answer is probably to add the blocklist.

If we later found it added too much overhead, the cost of dropping it is relatively low (the final result wouldn't change: links still wouldn't be indexed because the check would fail where it does not).

verified

mentioned in commit c36f229e36f1452046b8343c5192bc5305bd6757

Commit: c36f229e36f1452046b8343c5192bc5305bd6757 
Author: B Tasker                            
                            
Date: 2024-01-27T17:21:28.000+00:00 

Message

feat: skip links if they appear in config/do_not_crawl.txt (utilities/file_location_listing#45)

+21 -0 (21 lines changed)

To blocklist a domain, it needs to be added to config/do_not_crawl.txt

The file is expected to be a list of domain names (it didn't make sense to include scheme/path etc for this).

The intention is that this list should always be quite short, consisting only of domains that it's known crawled content regularly links out to. Using do_not_crawl for those domains bypasses a deletion check, making crawls a little more efficient.

Adding a previously indexed domain to do_not_crawl.txt will not lead to entries being deleted during the next crawl (although, obviously, re-validation will still be able to remove them).