I've not yet figured out why, but we seem to be making repeated reads of the same files - it showed up on my test system because I'd purged a couple of files from the store but not the index.
Closer examination, though, shows that we're loading each file (successful or not) about 3 times.
Things checked:
time.time()
in there shows they're about a 10th of a second apartNeed to figure out why.
Activity
30-Dec-23 16:56
assigned to @btasker
30-Dec-23 16:59
mentioned in commit 81f76ed59ee275e5e5385fa8619e6ae561735b69
Message
feat: implement simple file cache (utilities/file_location_listing#21)
30-Dec-23 16:59
Just to prove that it is an issue and not some weird logging bug, commit 81f76ed59ee275e5e5385fa8619e6ae561735b69 hacks a simple storage layer cache so that we only read a given file from disk once.
It's halved query times
30-Dec-23 16:59
mentioned in issue #16
30-Dec-23 17:08
Well....
This, it seems, was not entirely correct.
I rewrote the index to contain a single URL and then added a print to
searchIndexChunk()
The same URL does indeed appear multiple times:
It definitely only exists once in the on-disk index.
There must be some kind of issue when reading it in - for whatever reason we're adding each item into the compiled index 3 times...
30-Dec-23 17:11
When
loadIndex
completes, there's definitely only one copy in there```python {'ALL': [], 'IMAGE': [{'u': 'https://example.com/foo/bar.jpg', 'h': 'a877a608a58ffcef9fe70dcb0454e9b44bf7a104ae30cb829deb1e4120b8cc22', 't': 'image/jpeg', 'n': '', 'i': 0}], 'DOC': []} ````
30-Dec-23 17:16
By the time we're setting up the futures, though, there are 3 copies in there.
I wondered whether, perhaps, we were calling
loadIndex
multiple times, but adding aprint
suggests it's only being called once (still, that's something we should handle).30-Dec-23 17:30
Oh for fuck sake...
I don't know why the print wasn't showing up, but we're definitely calling
storage.loadIndex()
more than once.If we stand the server up
etc
It's because of this conditional in the server
The reason that's failing is because index reads do not update the global - they update a local scope variable.
Clearly, this ticket is the punishment I get for using globals :'(
30-Dec-23 17:32
Of course, there's a secondary issue: when we reload the index, we want it to reload not to append an additional copy to itself
30-Dec-23 17:32
mentioned in commit 01f712646e6b2fb77f13074b9948a508b1558e33
Message
fix: correctly update the time of last index load (utilities/file_location_listing#21)
30-Dec-23 17:38
mentioned in commit b837b126393a19a2ba40ff6d632a790885108413
Message
fix: clear in-memory index before reloading (utilities/file_location_listing#21)
30-Dec-23 18:02
mentioned in issue #22