project Utilities / File Location Listing avatar

utilities/file_location_listing#14: Revalidation Support



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 29-Dec-23 14:24



Description

The crawler will discover and add new content automatically, removing stored items for any broken links.

However, if a file goes away and is no longer linked to, the crawler won't be able to discover that it's broken.

We need a revalidation utility - it needs to be able to pull URLs out of the database and then check whether they're still valid (removing them if not).



Toggle State Changes

Activity


assigned to @btasker

There are two possible ways of going at this one:

  • Scan through the entire database, selecting pages with a valid-at older than a given threshold
  • Read in the index and pick a random selection

The first is more accurate but also much, much more expensive. An index scan is (by design) pretty cheap, whereas loading every bit on data on disk won't be

I think the answer is probably to start with using the index and then go from there

mentioned in issue #15

verified

mentioned in commit 7a5687ca1c0929d50bbc33d9c50d79d8dc908766

Commit: 7a5687ca1c0929d50bbc33d9c50d79d8dc908766 
Author: B Tasker                            
                            
Date: 2023-12-29T15:58:08.000+00:00 

Message

feat: implement script to trigger revalidation (utilities/file_location_listing#14)

Note that defunct items will not currently be removed

+79 -25 (104 lines changed)
verified

mentioned in commit 1cc508ab22a9b248a02ac21b76d164d011887faa

Commit: 1cc508ab22a9b248a02ac21b76d164d011887faa 
Author: B Tasker                            
                            
Date: 2023-12-29T16:11:08.000+00:00 

Message

feat: implement deletion of stored items (utilities/file_location_listing#14)

+36 -3 (39 lines changed)

We now have revalidation support - revalidate/revalidate.py will take a random selection of URLs from the index and then trigger a crawl of them. Deletion has been implemented so defunct (or now blocked by robots/skipstrings/etc) URLs will be removed from the DB.

The docker container has a new mode: reval which can be used to trigger revalidation:

docker run --rm -it \
-v /home/ben/tmp/search_db/:/search_db \
-e DB_PATH=/search_db \
-e MODE=reval \
-e REVAL_COUNT=1000 \
test