Currently, we revalidate URLs by
last-modified
Last-Modified
headerThis means that, if a page hasn't changed, we won't re-fetch it.
However, that also means that already crawled pages do not benefit from any future enhancements in the crawler. If, for example, we decided to start recording whether an image had alt-text - we'd have to blow storage away and recrawl everything from scratch.
If we start storing crawler version when storing data, revalidations could check whether the stored item was created by an older version and re-fetch if it was (perhaps even being smart and only refetching if the newer version means there'll have been a change in the data?)
Activity
05-Feb-24 14:59
assigned to @btasker
03-Mar-24 10:24
mentioned in commit cef32e6c1580ce09d248993d80045936f60f2e24
Message
feat: record the generating engine version in stored files (utilities/file_location_listing#46)
03-Mar-24 10:28
mentioned in commit 6193114f8190e27232b7227db9f40b3cb60ab7c4
Message
feat: storage.getStoredInfo() returns the version which generated the stored file (utilities/file_location_listing#46)
03-Mar-24 10:41
mentioned in commit b8fdbd7681a3c57d2fe79aa445be002f53c1d97b
Message
feat: refetch pages if the underlying engine version has changed (utilities/file_location_listing#46)
It's worth noting that, as it stands, deploying this will lead to the next crawl doing a full re-crawl.
We should probably look at making sure that that's not the case
03-Mar-24 10:48
The crawler will now look at the stored version when revalidating a page. If that's changed, it overrides the following checks
The first is overridden so that re-crawls can be done after a software update.
The second is overridden because the stored metadata needs to be updated (otherwise, the version check will fail on every crawl). We could perhaps have added the ability to edit files and update the metadata but that adds a load of complexity compared to simply re-fetching.
There is, however, a drawback to both.
Previous version (obviously) didn't store a version number, so any pre-existing stored info will return a "version" of
0.0.0
.This means that, if I were to cut a release now, a full re-crawl would end up being triggered.
Within the use-case for this change, that's what we want - the whole idea is that we're signalling a change in stored attributes.
In this case, that's undesirable - other than adding the version number, we haven't actually made any changes to the stored attributes (and won't), so it's not worth the additional load.
To avoid that, I think we should hardcode the default version value to be the same as is currently returned by
config.getEngineVersion()
(0.2.6
) - that way all existing storage files will, effectively, be grandfathered into this version.03-Mar-24 10:52
The other thing I want to look at before closing this out, is naming.
At the moment, we refer to the newly added version as the Engine Version (
getEngineVersion()
etc).I'm not sure that that's actually what we want though.
In the next release, we might (for example) release improvements to index navigation. Technically, the engine version will have changed, but there will have been no change to stored attributes.
In that scenario, we wouldn't want files to be re-fetched. But, the naming of the version string does imply that it should be updated with each release.
I think, before releasing, we should re-name this to something like Metadata Version (or StoreFile version?) to avoid any confusion in that area
03-Mar-24 10:57
mentioned in commit 1b4f614e7d99fee366a9bdbbc073ce1f134a9b2c
Message
fix: prevent an unnecessary recrawl as a result of version introduction (utilities/file_location_listing#46)
03-Mar-24 10:59
I've taken a slightly different approach to prevent a full recrawl.
I've added logic to the crawler rather than to the storage module:
If it were added to
storage
it'd likely need to be there forever, leaving a weird legacy of the default value being some old version.Doing it this way means that the weirdness can safely be removed in a later release - if
0.2.7
ends up changing storage attributes then that full recrawl will be both necessary & desirable.03-Mar-24 11:04
mentioned in commit 0d185c89aca015b423e589ae526f671c7ac0c783
Message
chore: rename EngineVersion to StoreFileVersion (utilities/file_location_listing#46)
03-Mar-24 11:04
In the latest commit,
EngineVersion
becomesStoreFileVersion
fetched by callingconfig.getStoreFileVersion()
03-Mar-24 11:59
mentioned in issue #49
03-Mar-24 12:27
Re-opening - #49 introduces a change to storage headers, so it now makes sense to allow the new release to trigger a recrawl, need to take that conditional back out.
03-Mar-24 12:30
mentioned in commit 7cdde3dcd0fbe0b1143813fd0e01ab300a8c15c9
Message
chore: remove temporary conditional.
It's no longer required as storefile metadata changes are being made in this release
Revert "fix: prevent an unnecessary recrawl as a result of version introduction (utilities/file_location_listing#46)"
This reverts commit 1b4f614e7d99fee366a9bdbbc073ce1f134a9b2c.
03-Mar-24 12:31
Change reverted. Re-closing.