project Utilities / File Location Listing avatar

utilities/file_location_listing#16: Search Performance

Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 29-Dec-23 23:56


After running a full crawl, the index is currently 7.1MB with 31,000 entries in it.

Searches are responsive, but not exactly startlingly fast - although it's certainly usable, I think it'd be worth looking at the performance characteristics to see where time is spent (and what improvements can be made).

Toggle State Changes


assigned to @btasker


mentioned in commit fe86162861b4d27a538e8afdfa78cda6e91f7122

Commit: fe86162861b4d27a538e8afdfa78cda6e91f7122 
Author: B Tasker                            
Date: 2023-12-29T23:56:32.000+00:00 


feat: print timing information when handling searches (utilities/file_location_listing#16)

+14 -1 (15 lines changed)

The commit above captures nanosecond level timings during the index and file scan process.

Searching a in my local DB (3.1MB index) returns 9058 results and gives the following timings

Disk loading time: 1654886462
Index scan time:   89588519
Total:             1744474981

Not overly surprisingly, disk loads are a significant proportion of time spent. As a relatively easy win, it may be worth looking at adding a layer of file caching so that subsequent searches reading a similar subset of files do not have to load them all from disk again.

Although it's not going to contribute much, we are hashing the URL in the course of running a search - we'll do that for every URL that passes the index based checks. sha256 is fast, but over the course of lots of URLs, that adds up.

It might be better to store the hash in the index in place of the path (which isn't currently being used). The path can be calculated from the hash (in fact, it already is) so we're not gaining anything by having that in there.


mentioned in commit 7ff5d0233f06942d29217c7723b4559c0d8a9a63

Commit: 7ff5d0233f06942d29217c7723b4559c0d8a9a63 
Author: B Tasker                            
Date: 2023-12-30T00:10:38.000+00:00 


fix: switch index to storing fname and use that when loading during search (utilities/file_location_listing#16)

+14 -10 (24 lines changed)

The index has been switched to storing the fname and search now passes that into getStoredItem().

Running the same search (after the index has been rebuilt):

Disk loading time: 523845678
Index scan time:   85503961
Total:             609349639

Disk loading time: 1100157644
Index scan time:   145567073
Total:             1245724717

Disk loading time: 1618240465
Index scan time:   224302875
Total:             1842543340

There's obviously some variation, but we are generally faster.

Although it'd mean re-crawling, the other thing to consider would be gzipping the storage files.

That way, physical storage would need to send fewer bytes (particularly important when you consider the storage is being exposed via NFS) at the cost of higher CPU usage locally.

It's quite easily achieved with the gzip module.


mentioned in commit bd70a7e70f0a55f18fa01921106a3a25b93f62cc

Commit: bd70a7e70f0a55f18fa01921106a3a25b93f62cc 
Author: B Tasker                            
Date: 2023-12-30T00:49:48.000+00:00 


feat: implement use of gzip for storage (utilities/file_location_listing#16)

There's currently support for both this and plaintext, leading to some awful code. That's intended as a temporary situation whilst I test performance

+76 -46 (122 lines changed)

Using gzip seems to work happily enough, so I'm going to strip out the dual mode stuff.


mentioned in commit 444e54a279bb32bd32df21a2fe28ca182456b969

Commit: 444e54a279bb32bd32df21a2fe28ca182456b969 
Author: B Tasker                            
Date: 2023-12-30T12:35:36.000+00:00 


chore: move storage over to gzip only (utilities/file_location_listing#16)

A recrawl will be required after this change - any existing plaintext files will be moved out of the way

+28 -42 (70 lines changed)

I've switched the test deployment over and purged the old storage - re-crawl is running at the moment.

Although it's reasonably fast, I don't think there's a way to simply restructure the index to speed up index scans.

If we were looking up exact matches it'd probably be possible to splinter the index so that you only have to search a fragment. But, we're checking whether a substring exists within the indexed key, so we need to iterate through every key in the index.

Of course, that is something that could be spun out to threads in order to parallelise the workload. That'd also mean that we're loading files in parallel, which could reduce the response time quite a bit.

The post-gzip recrawl is coming to an end (index is being built atm), so it should be possible to do some searches before deploying the multi-threading changes.

Index stats

$ wc -l ~/Downloads/index\(2\) 
27235 /home/ben/Downloads/index(2)

$ ls -sh ~/Downloads/index\(2\) 
5.4M '/home/ben/Downloads/index(2)'

Searchterm: b

Disk loading time: 134173032561
Index scan time:   210488047
Total:             134,383,520,608

Note: it failed in browser, the gateway hit timeout.

Searchterm: concrete

Disk loading time: 50763774
Index scan time:   39881149
Total:             90644923

Note: 5 results

Searchterm: win

Disk loading time: 748707392
Index scan time:   63475615
Total:             812183007

Note: 150 Results

Running the new image, with NUM_THREADS at 4

It's no longer possible to break time down between index scan and disk read, so the system simply logs the total run time.

Searchterm: b

Search Time:             49.636055764s

Searchterm: concrete

Search Time:             0.085569302s

5 results

Searchterm: win

Search Time:             0.39869117s

150 Results

I think I'd describe that as wildly successful then.

However, there probably are still some improvements that can be made:

  • The shorter the search term, the longer a search is going to take: it'd probably be prudent to implement a minimum search phrase length (utilities/file_location_listing#19)
  • It's extremely unlikely that thousands of results are actually going to be read, should implement a (customisable) LIMIT to cap the number of results that can be returned (utilities/file_location_listing#20)

mentioned in issue #20


mentioned in commit 1a8c4f5a69e120c6f07f503744e7797aca02e7bf

Commit: 1a8c4f5a69e120c6f07f503744e7797aca02e7bf 
Author: B Tasker                            
Date: 2023-12-30T15:01:54.000+00:00 


feat: use multiple threads to process searches (utilities/file_location_listing#16)

+86 -30 (116 lines changed)

There's something odd going on with reads (utilities/file_location_listing#21) - we seem to be reading files from disk more than once per search

Need to step away for a bit, but it'll be interesting to see what impact fixing #21 will have had on response times.

Results are still using uncapped

Searchterm: b

Search Time:             26.567205858s

Searchterm: concrete

Search Time:             0.014505064s

5 results

Searchterm: win

Search Time:             0.014529924s

150 Results

Just for reference, searching b with results capped at 300 took 2 seconds the first time (and 0.003911246s the second - file caching is still currently in place)

I'm going to close this issue down - I'm about ready to tag and build a release, so it doesn't make sense to leave this hanging open.

mentioned in issue #49