If we were to implement this, I think we'd probably want to take a fairly simple approach to invalidation:
Cache loaded files (in redis, maybe?)
If the index is re-read, flush the cache or set some kind of revalidation flag
A revalidation flag could work in much the same way as we check for index changes: has the reported mtime changed? If it hasn't, then the cached item can be considered revalidated.
The problem with the full-flush option is that, bearing in mind we're running in k8s, there might be multiple pods accessing the same shared cache. You don't really want pods blowing the cache away every time they come up (or load the index).
All that said, I think it'd probably be better to first look at whether we can speed file access up at all - caching only helps on subsequent accesses (which, if we're returning results well, should be quite rare).
Even with my (currently) small test database, this makes a significant difference: the first query takes 358ms whilst the second takes 7 even though the second search is run with a shortened term (and so matches more candidates).
I've also added caching to processSearchTerm() - the search portal submits a second search (to get related images) - the terms processing for that will be exactly the same, so there's no point wasting CPU time recomputing filters from it.
Activity
30-Dec-23 18:12
assigned to @btasker
30-Dec-23 18:12
mentioned in issue #22
06-Jan-24 16:06
If we were to implement this, I think we'd probably want to take a fairly simple approach to invalidation:
A revalidation flag could work in much the same way as we check for index changes: has the reported
mtime
changed? If it hasn't, then the cached item can be considered revalidated.The problem with the full-flush option is that, bearing in mind we're running in k8s, there might be multiple pods accessing the same shared cache. You don't really want pods blowing the cache away every time they come up (or load the index).
All that said, I think it'd probably be better to first look at whether we can speed file access up at all - caching only helps on subsequent accesses (which, if we're returning results well, should be quite rare).
06-Jan-24 20:03
Maybe I'm overthinking it though - we could start by just using
functools.lru_cache()
and seeing if that helps.07-Jan-24 15:59
mentioned in commit 207673c034ebc8f2a75ae93dd0a803facaf4a810
Message
feat: cache files read from disk (utilities/file_location_listing#23)
07-Jan-24 16:32
Even with my (currently) small test database, this makes a significant difference: the first query takes 358ms whilst the second takes 7 even though the second search is run with a shortened term (and so matches more candidates).
I've also added caching to
processSearchTerm()
- the search portal submits a second search (to get related images) - the terms processing for that will be exactly the same, so there's no point wasting CPU time recomputing filters from it.07-Jan-24 16:46
It seems OK so far, so I'm going to close this issue out (so we can do a release) and treat anything that follows as a bug.