Should add a new prefix
operator to fu in order to allow results to be filtered by the beginning of the filepath
So, if we have
https://www.example.com/foo/bar/sed.html
https://www.example.com/aaa/sed.html
Searching for
sed prefix:/foo/bar
Should only return the 1st
Activity
31-Dec-23 11:03
assigned to @btasker
31-Dec-23 11:06
The filepath isn't in the index as an individual field, so we have a few options here:
domain
dork - check the index key contains the string, then check the exact prefix after loading candidates from disk31-Dec-23 11:11
Thoughts:
2.
is really a practical option - we'd need to parse every identified key, adding a lot of overhead to searches.3.
achieves the same effect, but adds quite a lot of expense at index load time.4.
will add quite a lot of size to the index - arguably unnecessarily given that that information is already in the index even if in a less accessible form.I suspect the answer, ultimately, will be opt
3.
(as that'll also improve performance fordomain
checks), but to begin with I think we go with1.
rather than reinventing the wheel31-Dec-23 11:21
mentioned in commit 664f3857349a41fd15d0305f416321cf8f2528f2
Message
feat: implement
prefix
operator (utilities/file_location_listing#25)31-Dec-23 11:21
mentioned in commit 311ea6fe2b162bba3e47d58236c6064d8310210c
Message
docs: update help text (utilities/file_location_listing#25)
31-Dec-23 11:33
OK, so option 1 is in place.
I want to take a look at response times so that we can then put together a rough implementation of opt
3.
and see how that does.Searching without prefix filter
mean:
0.791392511333s
Adding a prefix filter
mean:
0.959639716s
It being a little longer makes sense - it's still got to load the same number of entries from storage even if it then returns fewer results
Ahead of hacking in opt
3.
I've added a timings print to index loading:Restarted a few times to get a mean:
Mean load time is
13747906.3333
nanoseconds (index is pretty small - it has 5257 entries)31-Dec-23 11:50
Hacking opt
3.
in withSame index file, load times:
mean:
33362546.3333
nanosecondsThat's an increase of
19614640.0033
nanoseconds (about a 43% increase, fairly substantial).What's the impact on query time though?
Searching without prefix filter
mean:
0.778359837333
sAdding a prefix filter
mean:
0.0350479153333
sThat's a pretty significant improvement for that search type
31-Dec-23 11:59
The index load time has increased significantly, but we're still talking much less than a second.
The question, of course, is: how long does a larger index take to load?
I've built a (much) larger index:
Load times
mean:
7218523334
ns (7.21s).Realistically, that's probably much larger than we expect the DB to grow and we'd want index format changes/improvements before we got there anyway.
Search time at that size isn't as bad as it could be though
31-Dec-23 12:04
So, the question is: is it worth the tradeoff in index load times?
Realistically, I'm probably not going to be using
prefix
all that regularly. But, most of that extra time is spent parsing the key - we get more than just the path from that, so thedomain
dork can also benefit. I can well imagine that I'll be using that fairly regularly.At the extreme end: although a 7s index load (really) isn't great, without the hack in place, that query likely wouldn't have completed in a meaningful timeframe
I think, with a bit of tidying, it's probably worth keeping in place.
31-Dec-23 12:10
mentioned in commit 9401635de9513e43ee2d93c96c0594ff33bf12cb
Message
feat: extract url components when reading index (utilities/file_location_listing#25)
31-Dec-23 12:11
Closing as done - I'll raise a seperate ticket to track moving
domain
over to using it31-Dec-23 12:13
mentioned in issue #26