This is intended to deliver a (smallish) efficiency gain to the crawler.
In my notes, I tend to link to Github a lot - it plays host to a lot of the things that I might want to refer on to.
Github, obviously, isn't in my site allowlists, but every Github link picked up during a crawl needs to be checked against a list of approved domains and regexes (admittedly, we do it a bit more efficiently than that).
What'd be good, is if it were possible to have a very small list of "blocked" domains so that we can short-circuit past the allow checks (or at least look at whether that delivers any kind of saving at all)
Activity
26-Jan-24 18:16
assigned to @btasker
26-Jan-24 18:17
Of course, I guess the counter argument is that it could also be viewed as adding additional overhead to all the URLs that are permitted.
26-Jan-24 18:22
Looking at
shouldCrawlURL()
there is the potential for a little bit of saving.There's not much to be made in terms of the lookup against authorised sites
realistically, we'd be searching a list of 1 or 2 items instead of a list of 10. Technically a saving, but likely to be immeasurably small.
But... if the URL isn't authorised, the function will then queue up a delete for any item stored against that URL - the idea being that if the URL was previously authorised, we want to remove it.
deleteStoredItem()
will thenIf we added a blocklist, we'd remove the need for hashing.
However, it might be better to simply negate the need for hashing in the first place: we could instead attempt to look the item up in the index (notes on that in the next comment)
26-Jan-24 18:26
The challenge with that approach, though, is what if there isn't an index?
The problem with
1)
is that it means that, if the index has been removed (for whatever reason), recently removed sites won't get purged from storage until a subsequent crawl. It's an edge case, but definitely undesirable behaviour.The problem with
2)
is that, whilst it would probably work, we're adding complexity (and therefore the cost of handling stuff that we're not going to index anyway)27-Jan-24 12:16
I guess the other option, and I'm not sure I like the sound it, would be to move the deletion check to be part of the reindexing process:
It would add an inordinate amount of expense to index builds (because we'd have to check every key in the DB rather than simply checking stuff as it's crawled), but it would mean that stuff was removed more quickly after a domain was removed from the crawl-list.
27-Jan-24 17:03
I think the logical answer is probably to add the blocklist.
If we later found it added too much overhead, the cost of dropping it is relatively low (the final result wouldn't change: links still wouldn't be indexed because the check would fail where it does not).
27-Jan-24 17:21
mentioned in commit c36f229e36f1452046b8343c5192bc5305bd6757
Message
feat: skip links if they appear in
config/do_not_crawl.txt
(utilities/file_location_listing#45)27-Jan-24 17:26
To blocklist a domain, it needs to be added to
config/do_not_crawl.txt
The file is expected to be a list of domain names (it didn't make sense to include scheme/path etc for this).
The intention is that this list should always be quite short, consisting only of domains that it's known crawled content regularly links out to. Using
do_not_crawl
for those domains bypasses a deletion check, making crawls a little more efficient.Adding a previously indexed domain to
do_not_crawl.txt
will not lead to entries being deleted during the next crawl (although, obviously, re-validation will still be able to remove them).