Should add a new prefix operator to fu in order to allow results to be filtered by the beginning of the filepath
So, if we have
https://www.example.com/foo/bar/sed.htmlhttps://www.example.com/aaa/sed.htmlSearching for
sed prefix:/foo/bar
Should only return the 1st
Activity
31-Dec-23 11:03
assigned to @btasker
31-Dec-23 11:06
The filepath isn't in the index as an individual field, so we have a few options here:
domaindork - check the index key contains the string, then check the exact prefix after loading candidates from disk31-Dec-23 11:11
Thoughts:
2.is really a practical option - we'd need to parse every identified key, adding a lot of overhead to searches.3.achieves the same effect, but adds quite a lot of expense at index load time.4.will add quite a lot of size to the index - arguably unnecessarily given that that information is already in the index even if in a less accessible form.I suspect the answer, ultimately, will be opt
3.(as that'll also improve performance fordomainchecks), but to begin with I think we go with1.rather than reinventing the wheel31-Dec-23 11:21
mentioned in commit 664f3857349a41fd15d0305f416321cf8f2528f2
Commit: 664f3857349a41fd15d0305f416321cf8f2528f2 Author: B Tasker Date: 2023-12-31T11:20:16.000+00:00Message
feat: implement
prefixoperator (utilities/file_location_listing#25)31-Dec-23 11:21
mentioned in commit 311ea6fe2b162bba3e47d58236c6064d8310210c
Commit: 311ea6fe2b162bba3e47d58236c6064d8310210c Author: B Tasker Date: 2023-12-31T11:21:03.000+00:00Message
docs: update help text (utilities/file_location_listing#25)
31-Dec-23 11:33
OK, so option 1 is in place.
I want to take a look at response times so that we can then put together a rough implementation of opt
3.and see how that does.Searching without prefix filter
mean:
0.791392511333sAdding a prefix filter
mean:
0.959639716sIt being a little longer makes sense - it's still got to load the same number of entries from storage even if it then returns fewer results
Ahead of hacking in opt
3.I've added a timings print to index loading:Restarted a few times to get a mean:
Mean load time is
13747906.3333nanoseconds (index is pretty small - it has 5257 entries)31-Dec-23 11:50
Hacking opt
3.in withSame index file, load times:
mean:
33362546.3333nanosecondsThat's an increase of
19614640.0033nanoseconds (about a 43% increase, fairly substantial).What's the impact on query time though?
Searching without prefix filter
mean:
0.778359837333sAdding a prefix filter
mean:
0.0350479153333sThat's a pretty significant improvement for that search type
31-Dec-23 11:59
The index load time has increased significantly, but we're still talking much less than a second.
The question, of course, is: how long does a larger index take to load?
I've built a (much) larger index:
Load times
mean:
7218523334ns (7.21s).Realistically, that's probably much larger than we expect the DB to grow and we'd want index format changes/improvements before we got there anyway.
Search time at that size isn't as bad as it could be though
31-Dec-23 12:04
So, the question is: is it worth the tradeoff in index load times?
Realistically, I'm probably not going to be using
prefixall that regularly. But, most of that extra time is spent parsing the key - we get more than just the path from that, so thedomaindork can also benefit. I can well imagine that I'll be using that fairly regularly.At the extreme end: although a 7s index load (really) isn't great, without the hack in place, that query likely wouldn't have completed in a meaningful timeframe
I think, with a bit of tidying, it's probably worth keeping in place.
31-Dec-23 12:10
mentioned in commit 9401635de9513e43ee2d93c96c0594ff33bf12cb
Commit: 9401635de9513e43ee2d93c96c0594ff33bf12cb Author: B Tasker Date: 2023-12-31T12:10:29.000+00:00Message
feat: extract url components when reading index (utilities/file_location_listing#25)
31-Dec-23 12:11
Closing as done - I'll raise a seperate ticket to track moving
domainover to using it31-Dec-23 12:13
mentioned in issue #26