project Utilities / File Location Listing avatar

utilities/file_location_listing#43: Index Handling Improvements



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.6
Created: 21-Jan-24 10:33



Description

At the moment, (for each index type) we write values out to a temporary index file then swap it into place. It's then read in and generally assumed to be correct.

This leaves us with a few issues

  • The temporary file is simply called index.tmp - it's not safe for two processes to try and build an index at the same time
  • The read in file isn't verified at all, so we can't be sure that it hasn't become corrupted on disk etc
  • Re-reads of the index flush the storage cache - if nothing's changed, this does nothing but harm cache performance

There are, undoubtedly, other issues lurking in there too. This issue is being raised to find and fix those as well as the above (obviously, any showstoppers should get their own issues)



Toggle State Changes

Activity


assigned to @btasker

Re-reads of the index flush the storage cache - if nothing's changed, this does nothing but harm cache performance

This one was actually partially addressed yesterday in d2335032533a89448f008ea32ce8b6a0beb0d172

feat: don't replace index unless reindex has actually led to changes
Storage now generates a checksum for each index file and will only swap in the newly generated index if it includes changes.
This means that the search front-end won't need to reload indexes unless something actually changed

The only issue is, the commit has a bug in it:

if hashIndex(f"{DB_PATH}/{index_type}") != hashIndex(f"{DB_PATH}/{index_type}"):

It's comparing the index to itself (rather than the temporary file) so will never update the index. Fixing that in a moment.

The read in file isn't verified at all, so we can't be sure that it hasn't become corrupted on disk etc

I think what I'd like to do, is to add a "header" line to the index - a single line containing metadata about the index.

Amongst those could be a checksum (it probably doesn't need to be much more than a CRC32) of the rest of the data, perhaps along with a line count.

That way, after loading, we can check whether we've loaded the data that the index claims to contain.

verified

mentioned in commit 9dc24de2b6ec22279a2d0ea8ab4d90f3a8c7fab4

Commit: 9dc24de2b6ec22279a2d0ea8ab4d90f3a8c7fab4 
Author: B Tasker                            
                            
Date: 2024-01-21T10:45:05.000+00:00 

Message

fix: compare index to temporary file, not to itself (see utilities/file_location_listing#43)

+1 -1 (2 lines changed)
verified

mentioned in commit 4bb054be80e97c32b2cbf5754d5fc2c5cca76cf7

Commit: 4bb054be80e97c32b2cbf5754d5fc2c5cca76cf7 
Author: B Tasker                            
                            
Date: 2024-01-21T11:23:52.000+00:00 

Message

feat: Read hash from index file header (utilities/file_location_listing#43)

+20 -3 (23 lines changed)
verified

mentioned in commit e51765f06f2e7c1ca08d8969ad94f71c4cd6b3a6

Commit: e51765f06f2e7c1ca08d8969ad94f71c4cd6b3a6 
Author: B Tasker                            
                            
Date: 2024-01-21T11:31:51.000+00:00 

Message

feat: add header to tags index (utilities/file_location_listing#43)

+9 -3 (12 lines changed)
verified

mentioned in commit 43323a2bc057d0d236f4c311d81c7c4cc7ae9ac6

Commit: 43323a2bc057d0d236f4c311d81c7c4cc7ae9ac6 
Author: B Tasker                            
                            
Date: 2024-01-21T12:01:41.000+00:00 

Message

feat: Read index headers at load and compare checksums (utilities/file_location_listing#43)

+81 -5 (86 lines changed)

The most recent commit adds stats on checksum mismatches:

    "warnings": {
      "index_checksum_mismatches": 1,
      "tags_checksum_mismatches": 0
    }

We obviously want that to always be 0. We can get a failure rate by dividing by index.loads

OK, obviously needs testing before we can think about cutting a release, but I think we're probably ready to merge that branch.

mentioned in merge request !4

mentioned in commit 05c9703b15bfe4c5440baa2bca74933b1f3e5d8b

Commit: 05c9703b15bfe4c5440baa2bca74933b1f3e5d8b 
Author: Ben Tasker                            
                            
Date: 2024-01-21T12:16:47.000+00:00 

Message

feat: Introduce index headers

+169 -16 (185 lines changed)

The temporary file is simply called index.tmp - it's not safe for two processes to try and build an index at the same time

There are two ways to go at this (though not strictly mutually exclusive):

  1. Generate a unique name for the temporary file
  2. Use locks

Option 1 is much simpler to implement but means we can still get unexpected behaviour - if an index build runs slow (for whatever reason) and another fast run starts, we might ultimately end up overwriting a new up-to-date index with an out of date one (when the slow run completes).

Option 2 prevents that, because the second run simply wouldn't start. It's the same outcome but with less resource wastage but does leave open the possibility of deadlock (we could perhaps expose a stat to show the presence of locks, allowing alerting to be implemented).

I think, for now, making sure the filename is unique will be more than sufficient.

I don't want to over-engineer this either. Whilst we'd give a strong guarantee of uniqueness by importing something like uuid (or shortuuid) adding that to storage would mean the server importing those at runtime for no good reason.

Whilst we could inline the import, for our purposes, it's probably actually enough to call time.time_ns() and use that as a filename suffix.

verified

mentioned in commit 7c847cc75720decf264862c411a03067c0399866

Commit: 7c847cc75720decf264862c411a03067c0399866 
Author: B Tasker                            
                            
Date: 2024-01-21T14:05:56.000+00:00 

Message

fix: apply a unique(ish) suffix to temp files (utilities/file_location_listing#43)

+8 -5 (13 lines changed)

Talking of locking though, there is potential for a race with the tombstone:

  1. indexStorage() runs
  2. new index is compiled
  3. A delete is triggered for an entry that's already been examined, removing file and updating tombstone
  4. new index is written out
  5. tombstone is deleted

When the new index is loaded, it'll include an item that no longer exists in storage. Whilst storage should handle this gracefully, it's still less than ideal.

There's a bit of a catch-22 here though:

  • If we remove the tombstone first, searches run whilst reindexing will lose the benefit of the tombstone (at least, if the index reloads in the meantime)
  • If we leave it until afterwards, we run into the issue above

What we probably need to do is

  1. take a copy of the tombstone
  2. do the indexing operation
  3. Check whether the main tombstone has changed on disk
  4. if not, remove it and the copy
  5. if so, identify changes and write only those out (or, maybe take the easy route and don't remove the tombstone - it'll likely get rationalised away next reindex)

Either way, I think this probably deserves it's own issue - raised as #44