#16 Search Performance : utilities/file_location

btasker Permalink
29-Dec-23 23:56

assigned to @btasker

btasker Permalink
29-Dec-23 23:56

verified

mentioned in commit fe86162861b4d27a538e8afdfa78cda6e91f7122

Commit: fe86162861b4d27a538e8afdfa78cda6e91f7122 
Author: B Tasker                            
                            
Date: 2023-12-29T23:56:32.000+00:00

Message

feat: print timing information when handling searches (utilities/file_location_listing#16)

+14 -1 (15 lines changed)

btasker Permalink
30-Dec-23 00:03

The commit above captures nanosecond level timings during the index and file scan process.

Searching a in my local DB (3.1MB index) returns 9058 results and gives the following timings

Disk loading time: 1654886462
Index scan time:   89588519
Total:             1744474981

Not overly surprisingly, disk loads are a significant proportion of time spent. As a relatively easy win, it may be worth looking at adding a layer of file caching so that subsequent searches reading a similar subset of files do not have to load them all from disk again.

btasker Permalink
30-Dec-23 00:07

Although it's not going to contribute much, we are hashing the URL in the course of running a search - we'll do that for every URL that passes the index based checks. sha256 is fast, but over the course of lots of URLs, that adds up.

It might be better to store the hash in the index in place of the path (which isn't currently being used). The path can be calculated from the hash (in fact, it already is) so we're not gaining anything by having that in there.

btasker Permalink
30-Dec-23 00:11

verified

mentioned in commit 7ff5d0233f06942d29217c7723b4559c0d8a9a63

Commit: 7ff5d0233f06942d29217c7723b4559c0d8a9a63 
Author: B Tasker                            
                            
Date: 2023-12-30T00:10:38.000+00:00

Message

fix: switch index to storing fname and use that when loading during search (utilities/file_location_listing#16)

+14 -10 (24 lines changed)

btasker Permalink
30-Dec-23 00:13

The index has been switched to storing the fname and search now passes that into getStoredItem().

Running the same search (after the index has been rebuilt):

Disk loading time: 523845678
Index scan time:   85503961
Total:             609349639

Disk loading time: 1100157644
Index scan time:   145567073
Total:             1245724717

Disk loading time: 1618240465
Index scan time:   224302875
Total:             1842543340

There's obviously some variation, but we are generally faster.

btasker Permalink
30-Dec-23 00:27

Although it'd mean re-crawling, the other thing to consider would be gzipping the storage files.

That way, physical storage would need to send fewer bytes (particularly important when you consider the storage is being exposed via NFS) at the cost of higher CPU usage locally.

It's quite easily achieved with the gzip module.

btasker Permalink
30-Dec-23 00:50

verified

mentioned in commit bd70a7e70f0a55f18fa01921106a3a25b93f62cc

Commit: bd70a7e70f0a55f18fa01921106a3a25b93f62cc 
Author: B Tasker                            
                            
Date: 2023-12-30T00:49:48.000+00:00

Message

feat: implement use of gzip for storage (utilities/file_location_listing#16)

There's currently support for both this and plaintext, leading to some awful code. That's intended as a temporary situation whilst I test performance

+76 -46 (122 lines changed)

btasker Permalink
30-Dec-23 12:26

Using gzip seems to work happily enough, so I'm going to strip out the dual mode stuff.

btasker Permalink
30-Dec-23 12:49

verified

mentioned in commit 444e54a279bb32bd32df21a2fe28ca182456b969

Commit: 444e54a279bb32bd32df21a2fe28ca182456b969 
Author: B Tasker                            
                            
Date: 2023-12-30T12:35:36.000+00:00

Message

chore: move storage over to gzip only (utilities/file_location_listing#16)

A recrawl will be required after this change - any existing plaintext files will be moved out of the way

+28 -42 (70 lines changed)

btasker Permalink
30-Dec-23 12:53

I've switched the test deployment over and purged the old storage - re-crawl is running at the moment.

btasker Permalink
30-Dec-23 15:02

Although it's reasonably fast, I don't think there's a way to simply restructure the index to speed up index scans.

If we were looking up exact matches it'd probably be possible to splinter the index so that you only have to search a fragment. But, we're checking whether a substring exists within the indexed key, so we need to iterate through every key in the index.

Of course, that is something that could be spun out to threads in order to parallelise the workload. That'd also mean that we're loading files in parallel, which could reduce the response time quite a bit.

btasker Permalink
30-Dec-23 15:25

The post-gzip recrawl is coming to an end (index is being built atm), so it should be possible to do some searches before deploying the multi-threading changes.

btasker Permalink
30-Dec-23 15:34

Index stats

$ wc -l ~/Downloads/index\(2\) 
27235 /home/ben/Downloads/index(2)

$ ls -sh ~/Downloads/index\(2\) 
5.4M '/home/ben/Downloads/index(2)'

Searchterm: b

Disk loading time: 134173032561
Index scan time:   210488047
Total:             134,383,520,608

Note: it failed in browser, the gateway hit timeout.

Searchterm: concrete

Disk loading time: 50763774
Index scan time:   39881149
Total:             90644923

Note: 5 results

Searchterm: win

Disk loading time: 748707392
Index scan time:   63475615
Total:             812183007

Note: 150 Results

btasker Permalink
30-Dec-23 15:41

Running the new image, with NUM_THREADS at 4

It's no longer possible to break time down between index scan and disk read, so the system simply logs the total run time.

Searchterm: b

Search Time:             49.636055764s

Searchterm: concrete

Search Time:             0.085569302s

5 results

Searchterm: win

Search Time:             0.39869117s

150 Results

btasker Permalink
30-Dec-23 15:43

I think I'd describe that as wildly successful then.

However, there probably are still some improvements that can be made:

The shorter the search term, the longer a search is going to take: it'd probably be prudent to implement a minimum search phrase length (utilities/file_location_listing#19)
It's extremely unlikely that thousands of results are actually going to be read, should implement a (customisable) LIMIT to cap the number of results that can be returned (utilities/file_location_listing#20)

btasker Permalink
30-Dec-23 15:50

mentioned in issue #20

btasker Permalink
30-Dec-23 16:11

verified

mentioned in commit 1a8c4f5a69e120c6f07f503744e7797aca02e7bf

Commit: 1a8c4f5a69e120c6f07f503744e7797aca02e7bf 
Author: B Tasker                            
                            
Date: 2023-12-30T15:01:54.000+00:00

Message

feat: use multiple threads to process searches (utilities/file_location_listing#16)

+86 -30 (116 lines changed)

btasker Permalink
30-Dec-23 16:59

There's something odd going on with reads (utilities/file_location_listing#21) - we seem to be reading files from disk more than once per search

btasker Permalink
30-Dec-23 17:39

Need to step away for a bit, but it'll be interesting to see what impact fixing #21 will have had on response times.

btasker Permalink
30-Dec-23 17:43

Results are still using uncapped

Searchterm: b

Search Time:             26.567205858s

Searchterm: concrete

Search Time:             0.014505064s

5 results

Searchterm: win

Search Time:             0.014529924s

150 Results

Just for reference, searching b with results capped at 300 took 2 seconds the first time (and 0.003911246s the second - file caching is still currently in place)

btasker Permalink
30-Dec-23 18:48

I'm going to close this issue down - I'm about ready to tag and build a release, so it doesn't make sense to leave this hanging open.

btasker Permalink
03-Mar-24 11:21

mentioned in issue #49

utilities/file_location_listing#16: Search Performance

Issue Information

Activity