At the moment, results are returned in (roughly) the order they're found. So, a result set will contain results from aaa.example.com
ahead of those from ddd.example.com
etc.
So far, that's not been an issue - assuming my search term is OK, I've not had any real issues finding what I need.
However, as pages get archived, things start to get more complex.
For example, I recently generated a static archive of our wiki and then moved any active pages over to using Gitlab's wiki.
The problem with this, is there are now multiple pages with the same title, and the older (non-updated) stuff is returned first:
What I'd like, is for the most recently updated files to be returned first. This could even be the default (I'm almost always going to want the most up to date)
Activity
03-Mar-24 11:17
assigned to @btasker
03-Mar-24 11:21
First, it's probably worth explaining the current ordering.
It's essentially "as it comes".
Results are found by iterating through the indexes. Indexes are sorted alphabetically, results will often be returned alphabetically.
However, since utilities/file_location_listing#16 the index is divided into chunks, each of which are then iterated through by different threads. The eventual results are then merged without sorting.
This means that results aren't strictly alphabetical.
03-Mar-24 11:41
At a simple level, ordering by
last-mod
shouldn't be too hard to add.But, there's some complexity around how
limit
is enforced.Currently, when a request comes in, it'll express a result limit (say 300). We then take this limit and pass it to each of the search threads (each is given the full limit, for reasons explained in [#20](/issue/utilities/file_location_listing/20.html))
Once results have been returned, we then truncate to the limit length.
Obviously, we need to perform the sort before that final truncation, but we're still not necessarily going to end up with the expected result set.
Imagine that we have the following pages and last-mods:
To keep things simple, we'll assume there's a single search thread.
A search for
example
is passed in, with a limit of 5. The thread will iterate through and select the first 5 matchesThese will then be sorted in descending order of last-mod, giving the searcher the following resultset
The problem is, these are not the most recent files.
bbb.example.com
has more recent files but they weren't checked because the resultcount was hit first.To address this, we'd need to make changes so that search threads read and return every possible match so that the full set can be ordered and then truncated.
As we saw in [#20](/issue/utilities/file_location_listing/20.html) though, that's hugely inefficient and can significantly affect search response times.
03-Mar-24 11:51
Obviously, using better quality search terms can help with this - the fewer matches there are to load from disk, the better the response time.
But, I don't think that's a full answer.
That means that the options are:
last-modified
to the index (would need supporting changes so that files aren't loaded from disk until after the index has been iterated through: that's quite a significant change)The only argument in favour of the third option is that I'm currently almost certainly the only user of the software - it'd be a quick and easy way to gauge impact, avoiding us throwing away a tonne of effort if it turns out it doesn't work as well as hoped.
The second option is potentially a lot of work. It'd also mean that we wouldn't later be able to allow searches to filter by data that only exists within stored files.
The first option means trading the current behaviour for a new default, with no real ability to switch between the two. It should, however, be relatively easy to implement (and the cost of sorting moves to index rebuild time rather than being added to searches).
I'm inclined to say that we should go with the 1st option, on the proviso that it may be rolled back if it turns out to have an undesirable effect on the utility of search results
03-Mar-24 11:59
Ah, the catch is - last-modified isn't currently available to the indexer, it works entirely off the storage file headers.
We'd need to do one of the following:
last modified
Doing the second is quite undesirable. The indexer uses headers because payloads have the potential to be extremely large - the idea being to keep indexing as cheap as possible.
If we did this in the next release (
0.2.6
) we'd need to roll back a small change made in #46 (though, that's not a bad thing as it'd be removing something that we'd need to roll back in a later version anyway).I think that's the answer - add the last modified date (as an epoch timestamp, making sorting much easier) to headers
03-Mar-24 12:18
mentioned in commit d28ffefe7f66c97d37f386aaae388b954cde32b3
Message
feat: order the index by last-modified date (utilities/file_location_listing#49)
03-Mar-24 12:20
The commit above adjusts so that the index is ordered by last-modified date.
If a file doesn't have a last modified, a value of
0
is used. The result of this is that it will get ordered last.One thing I've already noticed during experimentation - category/tag index pages now tend to float to the top (because they tend to get updated when something they link to changes). Time will tell whether that's a pro or a con.
03-Mar-24 12:26
mentioned in commit 8cd6d728b7eae3d87256db46b571709efbf7365c
Message
feat: re-sort results at search time to ensure ordering (utilities/file_location_listing#49)
This ensures that the search being split across worker threads does not result in indeterministic ordering
03-Mar-24 12:26
Given that this isn't really "custom" ordering, I'm going to update the issue title.
03-Mar-24 12:26
changed title from {-Custom-} Result ordering to {+Deterministic+} Result ordering
03-Mar-24 12:27
mentioned in issue #46
03-Mar-24 12:32
In theory this is now done.
Just waiting on a crawl to complete in dev so that I can test the searches that I noticed this with.
10-Mar-24 11:51
Busy week...
Just checked, this is working as expected with the original search.
10-Mar-24 11:57
mentioned in commit e29721fc925c1dfadd90042118d696998be45938
Message
fix: ensure
last-mod-s
is written as an integer (utilities/file_location_listing#49)10-Mar-24 11:58
However, testing with some other terms, I ran into an exception
dumping out the items being returned:
It's because we've used
time.strftime
to get the epoch, but not cast to an integer.10-Mar-24 12:14
Commit e29721fc fixes the way it's written into storage.
It's quite curious that it was working for the Wifi search though... those are all strings too.
Ahh, that's not why the other test failed though.
When searching for
Visitor Wifi
everything has alast-mod-s
value. They're strings, but what matter is that they are all strings (and so can all be compared).With the second search, that's not true. There are pages where
last-mod-s
is a string, but there are also entries where it's an integer:0
(indicating that the page didn't return a last-mod) - the result is we try and compare a string to an integer and throw an exception.The commit above should resolve that - we'll be storing as an integer.
Arguably, we could also add a cast to integer to the the sort line, but that would be introducing unnecessary overhead to searches.
Side note: I wondered how much more expensive calling
int()
on something that was already an int might be, turns out it's comparatively expensive:Result:
Given that that's for 10,000 iterations a human would never actually notice the difference whilst searching.
But, it's still unnecessary: no released version exists where
last-mod-s
has been stored as a string, the only system that would need a recrawl is my dev instance10-Mar-24 12:28
Recrawl running at the moment so that I can re-test
10-Mar-24 12:49
Yep, that's done the trick.
One other option for the future: when loading the index, we could conceivably build an alpha-ordered in-memory copy - that would allow us to offer alphabetic ordered results. I'm not convinced we'll need it though.
Either way, closing this change as Done.
18-Mar-24 18:59
Worth explicitly noting here: after upgrade, the changes made in this issue/release won't take effect until after a recrawl (and the subsequent re-index) - without that, the last-modified date won't be in the storage headers.
02-Jun-24 09:32
mentioned in issue #54
02-Jun-24 09:40
mentioned in issue #55