After running a full crawl, the index is currently 7.1MB with 31,000 entries in it.
Searches are responsive, but not exactly startlingly fast - although it's certainly usable, I think it'd be worth looking at the performance characteristics to see where time is spent (and what improvements can be made).
Activity
29-Dec-23 23:56
assigned to @btasker
29-Dec-23 23:56
mentioned in commit fe86162861b4d27a538e8afdfa78cda6e91f7122
Commit: fe86162861b4d27a538e8afdfa78cda6e91f7122 Author: B Tasker Date: 2023-12-29T23:56:32.000+00:00Message
feat: print timing information when handling searches (utilities/file_location_listing#16)
30-Dec-23 00:03
The commit above captures nanosecond level timings during the index and file scan process.
Searching
ain my local DB (3.1MB index) returns 9058 results and gives the following timingsNot overly surprisingly, disk loads are a significant proportion of time spent. As a relatively easy win, it may be worth looking at adding a layer of file caching so that subsequent searches reading a similar subset of files do not have to load them all from disk again.
30-Dec-23 00:07
Although it's not going to contribute much, we are hashing the URL in the course of running a search - we'll do that for every URL that passes the index based checks.
sha256is fast, but over the course of lots of URLs, that adds up.It might be better to store the hash in the index in place of the path (which isn't currently being used). The path can be calculated from the hash (in fact, it already is) so we're not gaining anything by having that in there.
30-Dec-23 00:11
mentioned in commit 7ff5d0233f06942d29217c7723b4559c0d8a9a63
Commit: 7ff5d0233f06942d29217c7723b4559c0d8a9a63 Author: B Tasker Date: 2023-12-30T00:10:38.000+00:00Message
fix: switch index to storing fname and use that when loading during search (utilities/file_location_listing#16)
30-Dec-23 00:13
The index has been switched to storing the
fnameand search now passes that intogetStoredItem().Running the same search (after the index has been rebuilt):
There's obviously some variation, but we are generally faster.
30-Dec-23 00:27
Although it'd mean re-crawling, the other thing to consider would be gzipping the storage files.
That way, physical storage would need to send fewer bytes (particularly important when you consider the storage is being exposed via NFS) at the cost of higher CPU usage locally.
It's quite easily achieved with the
gzipmodule.30-Dec-23 00:50
mentioned in commit bd70a7e70f0a55f18fa01921106a3a25b93f62cc
Commit: bd70a7e70f0a55f18fa01921106a3a25b93f62cc Author: B Tasker Date: 2023-12-30T00:49:48.000+00:00Message
feat: implement use of gzip for storage (utilities/file_location_listing#16)
There's currently support for both this and plaintext, leading to some awful code. That's intended as a temporary situation whilst I test performance
30-Dec-23 12:26
Using gzip seems to work happily enough, so I'm going to strip out the dual mode stuff.
30-Dec-23 12:49
mentioned in commit 444e54a279bb32bd32df21a2fe28ca182456b969
Commit: 444e54a279bb32bd32df21a2fe28ca182456b969 Author: B Tasker Date: 2023-12-30T12:35:36.000+00:00Message
chore: move storage over to gzip only (utilities/file_location_listing#16)
A recrawl will be required after this change - any existing plaintext files will be moved out of the way
30-Dec-23 12:53
I've switched the test deployment over and purged the old storage - re-crawl is running at the moment.
30-Dec-23 15:02
Although it's reasonably fast, I don't think there's a way to simply restructure the index to speed up index scans.
If we were looking up exact matches it'd probably be possible to splinter the index so that you only have to search a fragment. But, we're checking whether a substring exists within the indexed key, so we need to iterate through every key in the index.
Of course, that is something that could be spun out to threads in order to parallelise the workload. That'd also mean that we're loading files in parallel, which could reduce the response time quite a bit.
30-Dec-23 15:25
The post-gzip recrawl is coming to an end (index is being built atm), so it should be possible to do some searches before deploying the multi-threading changes.
30-Dec-23 15:34
Index stats
Searchterm:
bNote: it failed in browser, the gateway hit timeout.
Searchterm:
concreteNote: 5 results
Searchterm:
winNote: 150 Results
30-Dec-23 15:41
Running the new image, with
NUM_THREADSat4It's no longer possible to break time down between index scan and disk read, so the system simply logs the total run time.
Searchterm:
bSearchterm:
concrete5 results
Searchterm:
win150 Results
30-Dec-23 15:43
I think I'd describe that as wildly successful then.
However, there probably are still some improvements that can be made:
LIMITto cap the number of results that can be returned (utilities/file_location_listing#20)30-Dec-23 15:50
mentioned in issue #20
30-Dec-23 16:11
mentioned in commit 1a8c4f5a69e120c6f07f503744e7797aca02e7bf
Commit: 1a8c4f5a69e120c6f07f503744e7797aca02e7bf Author: B Tasker Date: 2023-12-30T15:01:54.000+00:00Message
feat: use multiple threads to process searches (utilities/file_location_listing#16)
30-Dec-23 16:59
There's something odd going on with reads (utilities/file_location_listing#21) - we seem to be reading files from disk more than once per search
30-Dec-23 17:39
Need to step away for a bit, but it'll be interesting to see what impact fixing #21 will have had on response times.
30-Dec-23 17:43
Results are still using uncapped
Searchterm:
bSearchterm:
concrete5 results
Searchterm:
win150 Results
Just for reference, searching
bwith results capped at 300 took 2 seconds the first time (and0.003911246sthe second - file caching is still currently in place)30-Dec-23 18:48
I'm going to close this issue down - I'm about ready to tag and build a release, so it doesn't make sense to leave this hanging open.
03-Mar-24 11:21
mentioned in issue #49