#43 Index Handling Improvements : utilities/file_location

btasker Permalink
21-Jan-24 10:33

assigned to @btasker

btasker Permalink
21-Jan-24 10:34

Re-reads of the index flush the storage cache - if nothing's changed, this does nothing but harm cache performance

This one was actually partially addressed yesterday in d2335032533a89448f008ea32ce8b6a0beb0d172

feat: don't replace index unless reindex has actually led to changes
Storage now generates a checksum for each index file and will only swap in the newly generated index if it includes changes.
This means that the search front-end won't need to reload indexes unless something actually changed

The only issue is, the commit has a bug in it:

if hashIndex(f"{DB_PATH}/{index_type}") != hashIndex(f"{DB_PATH}/{index_type}"):

It's comparing the index to itself (rather than the temporary file) so will never update the index. Fixing that in a moment.

btasker Permalink
21-Jan-24 10:36

The read in file isn't verified at all, so we can't be sure that it hasn't become corrupted on disk etc

I think what I'd like to do, is to add a "header" line to the index - a single line containing metadata about the index.

Amongst those could be a checksum (it probably doesn't need to be much more than a CRC32) of the rest of the data, perhaps along with a line count.

That way, after loading, we can check whether we've loaded the data that the index claims to contain.

btasker Permalink
21-Jan-24 10:45

verified

mentioned in commit 9dc24de2b6ec22279a2d0ea8ab4d90f3a8c7fab4

Commit: 9dc24de2b6ec22279a2d0ea8ab4d90f3a8c7fab4 
Author: B Tasker                            
                            
Date: 2024-01-21T10:45:05.000+00:00

Message

fix: compare index to temporary file, not to itself (see utilities/file_location_listing#43)

+1 -1 (2 lines changed)

btasker Permalink
21-Jan-24 11:25

verified

mentioned in commit 4bb054be80e97c32b2cbf5754d5fc2c5cca76cf7

Commit: 4bb054be80e97c32b2cbf5754d5fc2c5cca76cf7 
Author: B Tasker                            
                            
Date: 2024-01-21T11:23:52.000+00:00

Message

feat: Read hash from index file header (utilities/file_location_listing#43)

+20 -3 (23 lines changed)

btasker Permalink
21-Jan-24 11:32

verified

mentioned in commit e51765f06f2e7c1ca08d8969ad94f71c4cd6b3a6

Commit: e51765f06f2e7c1ca08d8969ad94f71c4cd6b3a6 
Author: B Tasker                            
                            
Date: 2024-01-21T11:31:51.000+00:00

Message

feat: add header to tags index (utilities/file_location_listing#43)

+9 -3 (12 lines changed)

btasker Permalink
21-Jan-24 12:06

verified

mentioned in commit 43323a2bc057d0d236f4c311d81c7c4cc7ae9ac6

Commit: 43323a2bc057d0d236f4c311d81c7c4cc7ae9ac6 
Author: B Tasker                            
                            
Date: 2024-01-21T12:01:41.000+00:00

Message

feat: Read index headers at load and compare checksums (utilities/file_location_listing#43)

+81 -5 (86 lines changed)

btasker Permalink
21-Jan-24 12:07

The most recent commit adds stats on checksum mismatches:

    "warnings": {
      "index_checksum_mismatches": 1,
      "tags_checksum_mismatches": 0
    }

We obviously want that to always be 0. We can get a failure rate by dividing by index.loads

btasker Permalink
21-Jan-24 12:13

OK, obviously needs testing before we can think about cutting a release, but I think we're probably ready to merge that branch.

btasker Permalink
21-Jan-24 12:15

mentioned in merge request !4

btasker Permalink
21-Jan-24 12:16

mentioned in commit 05c9703b15bfe4c5440baa2bca74933b1f3e5d8b

Commit: 05c9703b15bfe4c5440baa2bca74933b1f3e5d8b 
Author: Ben Tasker                            
                            
Date: 2024-01-21T12:16:47.000+00:00

Message

feat: Introduce index headers

feat: insert a header line when generating indexes
fix: Ensure that reindexing doesn't throw an exception when older indexes exist
feat: add header to tags index ([utilities/file_location_listing#43](/issue/utilities/file_location_listing/43.html))
feat: prefix header with fixed value When reading, this allows us to test the first character of the first line to see if it's a header (ensuring we don't bork if loading an older index)
feat: Read index headers at load and compare checksums ([utilities/file_location_listing#43](/issue/utilities/file_location_listing/43.html))
feat: add count of index checksum mismatches to stats
chore: remove defunct function hashIndex()

+169 -16 (185 lines changed)

btasker Permalink
21-Jan-24 12:21

The temporary file is simply called index.tmp - it's not safe for two processes to try and build an index at the same time

There are two ways to go at this (though not strictly mutually exclusive):

Generate a unique name for the temporary file
Use locks

Option 1 is much simpler to implement but means we can still get unexpected behaviour - if an index build runs slow (for whatever reason) and another fast run starts, we might ultimately end up overwriting a new up-to-date index with an out of date one (when the slow run completes).

Option 2 prevents that, because the second run simply wouldn't start. It's the same outcome but with less resource wastage but does leave open the possibility of deadlock (we could perhaps expose a stat to show the presence of locks, allowing alerting to be implemented).

btasker Permalink
21-Jan-24 14:04

I think, for now, making sure the filename is unique will be more than sufficient.

I don't want to over-engineer this either. Whilst we'd give a strong guarantee of uniqueness by importing something like uuid (or shortuuid) adding that to storage would mean the server importing those at runtime for no good reason.

Whilst we could inline the import, for our purposes, it's probably actually enough to call time.time_ns() and use that as a filename suffix.

btasker Permalink
21-Jan-24 14:06

verified

mentioned in commit 7c847cc75720decf264862c411a03067c0399866

Commit: 7c847cc75720decf264862c411a03067c0399866 
Author: B Tasker                            
                            
Date: 2024-01-21T14:05:56.000+00:00

Message

fix: apply a unique(ish) suffix to temp files (utilities/file_location_listing#43)

+8 -5 (13 lines changed)

btasker Permalink
21-Jan-24 14:14

Talking of locking though, there is potential for a race with the tombstone:

indexStorage() runs
new index is compiled
A delete is triggered for an entry that's already been examined, removing file and updating tombstone
new index is written out
tombstone is deleted

When the new index is loaded, it'll include an item that no longer exists in storage. Whilst storage should handle this gracefully, it's still less than ideal.

There's a bit of a catch-22 here though:

If we remove the tombstone first, searches run whilst reindexing will lose the benefit of the tombstone (at least, if the index reloads in the meantime)
If we leave it until afterwards, we run into the issue above

What we probably need to do is

take a copy of the tombstone
do the indexing operation
Check whether the main tombstone has changed on disk
if not, remove it and the copy
if so, identify changes and write only those out (or, maybe take the easy route and don't remove the tombstone - it'll likely get rationalised away next reindex)

Either way, I think this probably deserves it's own issue - raised as #44

utilities/file_location_listing#43: Index Handling Improvements

Issue Information

Activity