project Utilities / File Location Listing avatar

utilities/file_location_listing#46: Revalidation should take crawler version into account

Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.6
Created: 05-Feb-24 14:59


Currently, we revalidate URLs by

  • Checking they exist in the index
  • If so, reading last-modified
  • Comparing that to the URL's Last-Modified header

This means that, if a page hasn't changed, we won't re-fetch it.

However, that also means that already crawled pages do not benefit from any future enhancements in the crawler. If, for example, we decided to start recording whether an image had alt-text - we'd have to blow storage away and recrawl everything from scratch.

If we start storing crawler version when storing data, revalidations could check whether the stored item was created by an older version and re-fetch if it was (perhaps even being smart and only refetching if the newer version means there'll have been a change in the data?)

Toggle State Changes


assigned to @btasker


mentioned in commit cef32e6c1580ce09d248993d80045936f60f2e24

Commit: cef32e6c1580ce09d248993d80045936f60f2e24 
Author: B Tasker                            
Date: 2024-03-03T09:11:55.000+00:00 


feat: record the generating engine version in stored files (utilities/file_location_listing#46)

+21 -1 (22 lines changed)

mentioned in commit 6193114f8190e27232b7227db9f40b3cb60ab7c4

Commit: 6193114f8190e27232b7227db9f40b3cb60ab7c4 
Author: B Tasker                            
Date: 2024-03-03T10:28:06.000+00:00 


feat: storage.getStoredInfo() returns the version which generated the stored file (utilities/file_location_listing#46)

+9 -4 (13 lines changed)

mentioned in commit b8fdbd7681a3c57d2fe79aa445be002f53c1d97b

Commit: b8fdbd7681a3c57d2fe79aa445be002f53c1d97b 
Author: B Tasker                            
Date: 2024-03-03T10:40:07.000+00:00 


feat: refetch pages if the underlying engine version has changed (utilities/file_location_listing#46)

It's worth noting that, as it stands, deploying this will lead to the next crawl doing a full re-crawl.

We should probably look at making sure that that's not the case

+22 -3 (25 lines changed)

The crawler will now look at the stored version when revalidating a page. If that's changed, it overrides the following checks

  • Has the page been indexed today?
  • Has last-modified changed

The first is overridden so that re-crawls can be done after a software update.

The second is overridden because the stored metadata needs to be updated (otherwise, the version check will fail on every crawl). We could perhaps have added the ability to edit files and update the metadata but that adds a load of complexity compared to simply re-fetching.

There is, however, a drawback to both.

Previous version (obviously) didn't store a version number, so any pre-existing stored info will return a "version" of 0.0.0.

This means that, if I were to cut a release now, a full re-crawl would end up being triggered.

Within the use-case for this change, that's what we want - the whole idea is that we're signalling a change in stored attributes.

In this case, that's undesirable - other than adding the version number, we haven't actually made any changes to the stored attributes (and won't), so it's not worth the additional load.

To avoid that, I think we should hardcode the default version value to be the same as is currently returned by config.getEngineVersion() (0.2.6) - that way all existing storage files will, effectively, be grandfathered into this version.

The other thing I want to look at before closing this out, is naming.

At the moment, we refer to the newly added version as the Engine Version (getEngineVersion() etc).

I'm not sure that that's actually what we want though.

In the next release, we might (for example) release improvements to index navigation. Technically, the engine version will have changed, but there will have been no change to stored attributes.

In that scenario, we wouldn't want files to be re-fetched. But, the naming of the version string does imply that it should be updated with each release.

I think, before releasing, we should re-name this to something like Metadata Version (or StoreFile version?) to avoid any confusion in that area


mentioned in commit 1b4f614e7d99fee366a9bdbbc073ce1f134a9b2c

Commit: 1b4f614e7d99fee366a9bdbbc073ce1f134a9b2c 
Author: B Tasker                            
Date: 2024-03-03T10:57:08.000+00:00 


fix: prevent an unnecessary recrawl as a result of version introduction (utilities/file_location_listing#46)

+9 -2 (11 lines changed)

I've taken a slightly different approach to prevent a full recrawl.

I've added logic to the crawler rather than to the storage module:

    if last_mod != 'FFFFFF' and crawler_ver != ENGINE_VERSION:
        # This is a temporary check added when this logica was 
        # introduced (in '0.2.6'). It acts to prevent an unnecessary
        # recrawl of files that don't have the attribute.
        # TODO: remove this when the engine version next changes
        if ENGINE_VERSION != '0.2.6':
            print(f"\tIndexed by different engine version ({crawler_ver}) triggering recrawl")
            force_recrawl = True

If it were added to storage it'd likely need to be there forever, leaving a weird legacy of the default value being some old version.

Doing it this way means that the weirdness can safely be removed in a later release - if 0.2.7 ends up changing storage attributes then that full recrawl will be both necessary & desirable.


mentioned in commit 0d185c89aca015b423e589ae526f671c7ac0c783

Commit: 0d185c89aca015b423e589ae526f671c7ac0c783 
Author: B Tasker                            
Date: 2024-03-03T11:04:01.000+00:00 


chore: rename EngineVersion to StoreFileVersion (utilities/file_location_listing#46)

+17 -14 (31 lines changed)

In the latest commit, EngineVersion becomes StoreFileVersion fetched by calling config.getStoreFileVersion()

mentioned in issue #49

Re-opening - #49 introduces a change to storage headers, so it now makes sense to allow the new release to trigger a recrawl, need to take that conditional back out.


mentioned in commit 7cdde3dcd0fbe0b1143813fd0e01ab300a8c15c9

Commit: 7cdde3dcd0fbe0b1143813fd0e01ab300a8c15c9 
Author: B Tasker                            
Date: 2024-03-03T12:29:53.000+00:00 


chore: remove temporary conditional.

It's no longer required as storefile metadata changes are being made in this release

Revert "fix: prevent an unnecessary recrawl as a result of version introduction (utilities/file_location_listing#46)"

This reverts commit 1b4f614e7d99fee366a9bdbbc073ce1f134a9b2c.

+3 -12 (15 lines changed)

Change reverted. Re-closing.