After running a full crawl, the index is currently 7.1MB with 31,000 entries in it.
Searches are responsive, but not exactly startlingly fast - although it's certainly usable, I think it'd be worth looking at the performance characteristics to see where time is spent (and what improvements can be made).
Activity
29-Dec-23 23:56
assigned to @btasker
29-Dec-23 23:56
mentioned in commit fe86162861b4d27a538e8afdfa78cda6e91f7122
Message
feat: print timing information when handling searches (utilities/file_location_listing#16)
30-Dec-23 00:03
The commit above captures nanosecond level timings during the index and file scan process.
Searching
a
in my local DB (3.1MB index) returns 9058 results and gives the following timingsNot overly surprisingly, disk loads are a significant proportion of time spent. As a relatively easy win, it may be worth looking at adding a layer of file caching so that subsequent searches reading a similar subset of files do not have to load them all from disk again.
30-Dec-23 00:07
Although it's not going to contribute much, we are hashing the URL in the course of running a search - we'll do that for every URL that passes the index based checks.
sha256
is fast, but over the course of lots of URLs, that adds up.It might be better to store the hash in the index in place of the path (which isn't currently being used). The path can be calculated from the hash (in fact, it already is) so we're not gaining anything by having that in there.
30-Dec-23 00:11
mentioned in commit 7ff5d0233f06942d29217c7723b4559c0d8a9a63
Message
fix: switch index to storing fname and use that when loading during search (utilities/file_location_listing#16)
30-Dec-23 00:13
The index has been switched to storing the
fname
and search now passes that intogetStoredItem()
.Running the same search (after the index has been rebuilt):
There's obviously some variation, but we are generally faster.
30-Dec-23 00:27
Although it'd mean re-crawling, the other thing to consider would be gzipping the storage files.
That way, physical storage would need to send fewer bytes (particularly important when you consider the storage is being exposed via NFS) at the cost of higher CPU usage locally.
It's quite easily achieved with the
gzip
module.30-Dec-23 00:50
mentioned in commit bd70a7e70f0a55f18fa01921106a3a25b93f62cc
Message
feat: implement use of gzip for storage (utilities/file_location_listing#16)
There's currently support for both this and plaintext, leading to some awful code. That's intended as a temporary situation whilst I test performance
30-Dec-23 12:26
Using gzip seems to work happily enough, so I'm going to strip out the dual mode stuff.
30-Dec-23 12:49
mentioned in commit 444e54a279bb32bd32df21a2fe28ca182456b969
Message
chore: move storage over to gzip only (utilities/file_location_listing#16)
A recrawl will be required after this change - any existing plaintext files will be moved out of the way
30-Dec-23 12:53
I've switched the test deployment over and purged the old storage - re-crawl is running at the moment.
30-Dec-23 15:02
Although it's reasonably fast, I don't think there's a way to simply restructure the index to speed up index scans.
If we were looking up exact matches it'd probably be possible to splinter the index so that you only have to search a fragment. But, we're checking whether a substring exists within the indexed key, so we need to iterate through every key in the index.
Of course, that is something that could be spun out to threads in order to parallelise the workload. That'd also mean that we're loading files in parallel, which could reduce the response time quite a bit.
30-Dec-23 15:25
The post-gzip recrawl is coming to an end (index is being built atm), so it should be possible to do some searches before deploying the multi-threading changes.
30-Dec-23 15:34
Index stats
Searchterm:
b
Note: it failed in browser, the gateway hit timeout.
Searchterm:
concrete
Note: 5 results
Searchterm:
win
Note: 150 Results
30-Dec-23 15:41
Running the new image, with
NUM_THREADS
at4
It's no longer possible to break time down between index scan and disk read, so the system simply logs the total run time.
Searchterm:
b
Searchterm:
concrete
5 results
Searchterm:
win
150 Results
30-Dec-23 15:43
I think I'd describe that as wildly successful then.
However, there probably are still some improvements that can be made:
LIMIT
to cap the number of results that can be returned (utilities/file_location_listing#20)30-Dec-23 15:50
mentioned in issue #20
30-Dec-23 16:11
mentioned in commit 1a8c4f5a69e120c6f07f503744e7797aca02e7bf
Message
feat: use multiple threads to process searches (utilities/file_location_listing#16)
30-Dec-23 16:59
There's something odd going on with reads (utilities/file_location_listing#21) - we seem to be reading files from disk more than once per search
30-Dec-23 17:39
Need to step away for a bit, but it'll be interesting to see what impact fixing #21 will have had on response times.
30-Dec-23 17:43
Results are still using uncapped
Searchterm:
b
Searchterm:
concrete
5 results
Searchterm:
win
150 Results
Just for reference, searching
b
with results capped at 300 took 2 seconds the first time (and0.003911246s
the second - file caching is still currently in place)30-Dec-23 18:48
I'm going to close this issue down - I'm about ready to tag and build a release, so it doesn't make sense to leave this hanging open.
03-Mar-24 11:21
mentioned in issue #49