The crawler will discover and add new content automatically, removing stored items for any broken links.
However, if a file goes away and is no longer linked to, the crawler won't be able to discover that it's broken.
We need a revalidation utility - it needs to be able to pull URLs out of the database and then check whether they're still valid (removing them if not).
Activity
29-Dec-23 14:24
assigned to @btasker
29-Dec-23 14:26
There are two possible ways of going at this one:
valid-at
older than a given thresholdThe first is more accurate but also much, much more expensive. An index scan is (by design) pretty cheap, whereas loading every bit on data on disk won't be
29-Dec-23 15:11
I think the answer is probably to start with using the index and then go from there
29-Dec-23 15:12
mentioned in issue #15
29-Dec-23 15:58
mentioned in commit 7a5687ca1c0929d50bbc33d9c50d79d8dc908766
Message
feat: implement script to trigger revalidation (utilities/file_location_listing#14)
Note that defunct items will not currently be removed
29-Dec-23 16:11
mentioned in commit 1cc508ab22a9b248a02ac21b76d164d011887faa
Message
feat: implement deletion of stored items (utilities/file_location_listing#14)
29-Dec-23 16:16
We now have revalidation support -
revalidate/revalidate.py
will take a random selection of URLs from the index and then trigger a crawl of them. Deletion has been implemented so defunct (or now blocked by robots/skipstrings/etc) URLs will be removed from the DB.The docker container has a new mode:
reval
which can be used to trigger revalidation: