#14 Revalidation Support : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: v0.1

Created: 29-Dec-23 14:24

Labels: Fixed/Done New Feature

Description

The crawler will discover and add new content automatically, removing stored items for any broken links.

However, if a file goes away and is no longer linked to, the crawler won't be able to discover that it's broken.

We need a revalidation utility - it needs to be able to pull URLs out of the database and then check whether they're still valid (removing them if not).

Toggle State Changes

Activity

btasker Permalink
29-Dec-23 14:24

assigned to @btasker

btasker Permalink
29-Dec-23 14:26

There are two possible ways of going at this one:

Scan through the entire database, selecting pages with a valid-at older than a given threshold
Read in the index and pick a random selection

The first is more accurate but also much, much more expensive. An index scan is (by design) pretty cheap, whereas loading every bit on data on disk won't be

btasker Permalink
29-Dec-23 15:11

I think the answer is probably to start with using the index and then go from there

btasker Permalink
29-Dec-23 15:12

mentioned in issue #15

btasker Permalink
29-Dec-23 15:58

verified

mentioned in commit 7a5687ca1c0929d50bbc33d9c50d79d8dc908766

Commit: 7a5687ca1c0929d50bbc33d9c50d79d8dc908766 
Author: B Tasker                            
                            
Date: 2023-12-29T15:58:08.000+00:00

Message

feat: implement script to trigger revalidation (utilities/file_location_listing#14)

Note that defunct items will not currently be removed

+79 -25 (104 lines changed)

btasker Permalink
29-Dec-23 16:11

verified

mentioned in commit 1cc508ab22a9b248a02ac21b76d164d011887faa

Commit: 1cc508ab22a9b248a02ac21b76d164d011887faa 
Author: B Tasker                            
                            
Date: 2023-12-29T16:11:08.000+00:00

Message

feat: implement deletion of stored items (utilities/file_location_listing#14)

+36 -3 (39 lines changed)

btasker Permalink
29-Dec-23 16:16

We now have revalidation support - revalidate/revalidate.py will take a random selection of URLs from the index and then trigger a crawl of them. Deletion has been implemented so defunct (or now blocked by robots/skipstrings/etc) URLs will be removed from the DB.

The docker container has a new mode: reval which can be used to trigger revalidation:

docker run --rm -it \
-v /home/ben/tmp/search_db/:/search_db \
-e DB_PATH=/search_db \
-e MODE=reval \
-e REVAL_COUNT=1000 \
test

utilities/file_location_listing#14: Revalidation Support

Issue Information

Activity