project Utilities / File Location Listing avatar

utilities/file_location_listing#21: Repeated reads of files



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 30-Dec-23 16:56



Description

I've not yet figured out why, but we seem to be making repeated reads of the same files - it showed up on my test system because I'd purged a couple of files from the store but not the index.

Closer examination, though, shows that we're loading each file (successful or not) about 3 times.

Things checked:

  • The threading stuff is definitely only firing the function once per thread
  • It's not simply loglines repeating for some reason - inserting a time.time() in there shows they're about a 10th of a second apart
  • There's no repetition of URLs in the index (I added a counter just to be sure)

Need to figure out why.



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 81f76ed59ee275e5e5385fa8619e6ae561735b69

Commit: 81f76ed59ee275e5e5385fa8619e6ae561735b69 
Author: B Tasker                            
                            
Date: 2023-12-30T16:56:23.000+00:00 

Message

feat: implement simple file cache (utilities/file_location_listing#21)

+25 -2 (27 lines changed)

Just to prove that it is an issue and not some weird logging bug, commit 81f76ed59ee275e5e5385fa8619e6ae561735b69 hacks a simple storage layer cache so that we only read a given file from disk once.

It's halved query times

mentioned in issue #16

Well....

There's no repetition of URLs in the index (I added a counter just to be sure)

This, it seems, was not entirely correct.

I rewrote the index to contain a single URL and then added a print to searchIndexChunk()

def searchIndexChunk(idx, filters, matchtypes, idx_num, limit):
    ''' Run a search against a portion of the index

    '''
    results = []
    count = 0
    print(f"Triggered {idx_num}")
    print(idx)
    for e in idx:

The same URL does indeed appear multiple times:

[{'u': 'https://example.com/foo/bar.jpg', 'h': 'a877a608a58ffcef9fe70dcb0454e9b44bf7a104ae30cb829deb1e4120b8cc22', 't': 'image/jpeg', 'n': '', 'i': 0},
 {'u': 'https://example.com/foo/bar.jpg', 'h': 'a877a608a58ffcef9fe70dcb0454e9b44bf7a104ae30cb829deb1e4120b8cc22', 't': 'image/jpeg', 'n': '', 'i': 0},
 {'u': 'https://example.com/foo/bar.jpg', 'h': 'a877a608a58ffcef9fe70dcb0454e9b44bf7a104ae30cb829deb1e4120b8cc22', 't': 'image/jpeg', 'n': '', 'i': 0}]

It definitely only exists once in the on-disk index.

There must be some kind of issue when reading it in - for whatever reason we're adding each item into the compiled index 3 times...

When loadIndex completes, there's definitely only one copy in there

            idx_num += 1
            if idx_num >= CONCURRENCY:
                idx_num = 0

    LOADED_AT = time.time()
    print(INDEXES)

```python {'ALL': [], 'IMAGE': [{'u': 'https://example.com/foo/bar.jpg', 'h': 'a877a608a58ffcef9fe70dcb0454e9b44bf7a104ae30cb829deb1e4120b8cc22', 't': 'image/jpeg', 'n': '', 'i': 0}], 'DOC': []} ````

By the time we're setting up the futures, though, there are 3 copies in there.

I wondered whether, perhaps, we were calling loadIndex multiple times, but adding a print suggests it's only being called once (still, that's something we should handle).

Oh for fuck sake...

I don't know why the print wasn't showing up, but we're definitely calling storage.loadIndex() more than once.

If we stand the server up

  • 1st search: there are 2 copies in the index
  • 2nd search: there are 3 copies in the index
  • 3rd search: there are 4 copies in the index

etc

It's because of this conditional in the server

    if time.time() - storage.LOADED_AT > 3600:
        print("RELOAD INDEX")
        storage.loadIndex()

The reason that's failing is because index reads do not update the global - they update a local scope variable.

Clearly, this ticket is the punishment I get for using globals :'(

Of course, there's a secondary issue: when we reload the index, we want it to reload not to append an additional copy to itself

verified

mentioned in commit 01f712646e6b2fb77f13074b9948a508b1558e33

Commit: 01f712646e6b2fb77f13074b9948a508b1558e33 
Author: B Tasker                            
                            
Date: 2023-12-30T17:31:54.000+00:00 

Message

fix: correctly update the time of last index load (utilities/file_location_listing#21)

+5 -0 (5 lines changed)
verified

mentioned in commit b837b126393a19a2ba40ff6d632a790885108413

Commit: b837b126393a19a2ba40ff6d632a790885108413 
Author: B Tasker                            
                            
Date: 2023-12-30T17:38:06.000+00:00 

Message

fix: clear in-memory index before reloading (utilities/file_location_listing#21)

+11 -1 (12 lines changed)

mentioned in issue #22