#45 Domain Blocklist : utilities/file_location

btasker Permalink
26-Jan-24 18:16

assigned to @btasker

btasker Permalink
26-Jan-24 18:17

Of course, I guess the counter argument is that it could also be viewed as adding additional overhead to all the URLs that are permitted.

btasker Permalink
26-Jan-24 18:22

Looking at shouldCrawlURL() there is the potential for a little bit of saving.

There's not much to be made in terms of the lookup against authorised sites

if parsed_host.lower() not in SITE_LIST:

realistically, we'd be searching a list of 1 or 2 items instead of a list of 10. Technically a saving, but likely to be immeasurably small.

But... if the URL isn't authorised, the function will then queue up a delete for any item stored against that URL - the idea being that if the URL was previously authorised, we want to remove it.

deleteStoredItem() will then

hash the URL
Check whether an entry exists in storage for the URL
Remove it if it does

If we added a blocklist, we'd remove the need for hashing.

However, it might be better to simply negate the need for hashing in the first place: we could instead attempt to look the item up in the index (notes on that in the next comment)

btasker Permalink
26-Jan-24 18:26

we could instead attempt to look the item up in the index (notes on that in the next comment)

The challenge with that approach, though, is what if there isn't an index?

We could assume that it doesn't exist, so not attempt to pursue a delete
We could fall back to the original behaviour

The problem with 1) is that it means that, if the index has been removed (for whatever reason), recently removed sites won't get purged from storage until a subsequent crawl. It's an edge case, but definitely undesirable behaviour.

The problem with 2) is that, whilst it would probably work, we're adding complexity (and therefore the cost of handling stuff that we're not going to index anyway)

btasker Permalink
27-Jan-24 12:16

I guess the other option, and I'm not sure I like the sound it, would be to move the deletion check to be part of the reindexing process:

iterate through storage items (as now)
check each against the permissions list
Delete any that don't match

It would add an inordinate amount of expense to index builds (because we'd have to check every key in the DB rather than simply checking stuff as it's crawled), but it would mean that stuff was removed more quickly after a domain was removed from the crawl-list.

btasker Permalink
27-Jan-24 17:03

I think the logical answer is probably to add the blocklist.

If we later found it added too much overhead, the cost of dropping it is relatively low (the final result wouldn't change: links still wouldn't be indexed because the check would fail where it does not).

btasker Permalink
27-Jan-24 17:21

verified

mentioned in commit c36f229e36f1452046b8343c5192bc5305bd6757

Commit: c36f229e36f1452046b8343c5192bc5305bd6757 
Author: B Tasker                            
                            
Date: 2024-01-27T17:21:28.000+00:00

Message

feat: skip links if they appear in config/do_not_crawl.txt (utilities/file_location_listing#45)

+21 -0 (21 lines changed)

btasker Permalink
27-Jan-24 17:26

To blocklist a domain, it needs to be added to config/do_not_crawl.txt

The file is expected to be a list of domain names (it didn't make sense to include scheme/path etc for this).

The intention is that this list should always be quite short, consisting only of domains that it's known crawled content regularly links out to. Using do_not_crawl for those domains bypasses a deletion check, making crawls a little more efficient.

Adding a previously indexed domain to do_not_crawl.txt will not lead to entries being deleted during the next crawl (although, obviously, re-validation will still be able to remove them).

utilities/file_location_listing#45: Domain Blocklist

Issue Information

Activity