project Utilities / File Location Listing avatar

utilities/file_location_listing#49: Deterministic Result ordering



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.6
Created: 03-Mar-24 11:17



Description

At the moment, results are returned in (roughly) the order they're found. So, a result set will contain results from aaa.example.com ahead of those from ddd.example.com etc.

So far, that's not been an issue - assuming my search term is OK, I've not had any real issues finding what I need.

However, as pages get archived, things start to get more complex.

For example, I recently generated a static archive of our wiki and then moved any active pages over to using Gitlab's wiki.

The problem with this, is there are now multiple pages with the same title, and the older (non-updated) stuff is returned first:

Screenshot_20240303_111524

What I'd like, is for the most recently updated files to be returned first. This could even be the default (I'm almost always going to want the most up to date)



Toggle State Changes

Activity


assigned to @btasker

First, it's probably worth explaining the current ordering.

It's essentially "as it comes".

Results are found by iterating through the indexes. Indexes are sorted alphabetically, results will often be returned alphabetically.

However, since utilities/file_location_listing#16 the index is divided into chunks, each of which are then iterated through by different threads. The eventual results are then merged without sorting.

This means that results aren't strictly alphabetical.

At a simple level, ordering by last-mod shouldn't be too hard to add.

But, there's some complexity around how limit is enforced.

Currently, when a request comes in, it'll express a result limit (say 300). We then take this limit and pass it to each of the search threads (each is given the full limit, for reasons explained in [#20](/issue/utilities/file_location_listing/20.html))

Once results have been returned, we then truncate to the limit length.

Obviously, we need to perform the sort before that final truncation, but we're still not necessarily going to end up with the expected result set.

Imagine that we have the following pages and last-mods:

http://aaa.example.com/foo1 2021-02-05
http://aaa.example.com/foo2 2021-03-05
http://aaa.example.com/foo3 2021-04-05
http://aaa.example.com/foo4 2021-05-05
http://aaa.example.com/foo5 2021-06-05
http://bbb.example.com/bar1 2022-02-05
http://bbb.example.com/bar2 2022-02-05
http://bbb.example.com/bar3 2022-02-05
http://bbb.example.com/bar4 2023-02-05
http://bbb.example.com/bar5 2024-02-05

To keep things simple, we'll assume there's a single search thread.

A search for example is passed in, with a limit of 5. The thread will iterate through and select the first 5 matches

http://aaa.example.com/foo1 2021-02-05
http://aaa.example.com/foo2 2021-03-05
http://aaa.example.com/foo3 2021-04-05
http://aaa.example.com/foo4 2021-05-05
http://aaa.example.com/foo5 2021-06-05

These will then be sorted in descending order of last-mod, giving the searcher the following resultset

http://aaa.example.com/foo5 2021-06-05
http://aaa.example.com/foo4 2021-05-05
http://aaa.example.com/foo3 2021-04-05
http://aaa.example.com/foo2 2021-03-05
http://aaa.example.com/foo1 2021-02-05

The problem is, these are not the most recent files. bbb.example.com has more recent files but they weren't checked because the resultcount was hit first.

To address this, we'd need to make changes so that search threads read and return every possible match so that the full set can be ordered and then truncated.

As we saw in [#20](/issue/utilities/file_location_listing/20.html) though, that's hugely inefficient and can significantly affect search response times.

Obviously, using better quality search terms can help with this - the fewer matches there are to load from disk, the better the response time.

But, I don't think that's a full answer.

That means that the options are:

  • Order the index by last-modified date rather than alphabetically (with the drawback that alphabetical searches then become harder)
  • Add last-modified to the index (would need supporting changes so that files aren't loaded from disk until after the index has been iterated through: that's quite a significant change)
  • Add the ordering "as is" (meaning we'd be releasing with a weird and unintuitive limitation - not ideal by any stretch of the imagination)

The only argument in favour of the third option is that I'm currently almost certainly the only user of the software - it'd be a quick and easy way to gauge impact, avoiding us throwing away a tonne of effort if it turns out it doesn't work as well as hoped.

The second option is potentially a lot of work. It'd also mean that we wouldn't later be able to allow searches to filter by data that only exists within stored files.

The first option means trading the current behaviour for a new default, with no real ability to switch between the two. It should, however, be relatively easy to implement (and the cost of sorting moves to index rebuild time rather than being added to searches).

I'm inclined to say that we should go with the 1st option, on the proviso that it may be rolled back if it turns out to have an undesirable effect on the utility of search results

Ah, the catch is - last-modified isn't currently available to the indexer, it works entirely off the storage file headers.

We'd need to do one of the following:

  • update headers to include last modified
  • parse the JSON payload

Doing the second is quite undesirable. The indexer uses headers because payloads have the potential to be extremely large - the idea being to keep indexing as cheap as possible.

If we did this in the next release (0.2.6) we'd need to roll back a small change made in #46 (though, that's not a bad thing as it'd be removing something that we'd need to roll back in a later version anyway).

I think that's the answer - add the last modified date (as an epoch timestamp, making sorting much easier) to headers

verified

mentioned in commit d28ffefe7f66c97d37f386aaae388b954cde32b3

Commit: d28ffefe7f66c97d37f386aaae388b954cde32b3 
Author: B Tasker                            
                            
Date: 2024-03-03T12:17:57.000+00:00 

Message

feat: order the index by last-modified date (utilities/file_location_listing#49)

+22 -6 (28 lines changed)

The commit above adjusts so that the index is ordered by last-modified date.

If a file doesn't have a last modified, a value of 0 is used. The result of this is that it will get ordered last.

One thing I've already noticed during experimentation - category/tag index pages now tend to float to the top (because they tend to get updated when something they link to changes). Time will tell whether that's a pro or a con.

verified

mentioned in commit 8cd6d728b7eae3d87256db46b571709efbf7365c

Commit: 8cd6d728b7eae3d87256db46b571709efbf7365c 
Author: B Tasker                            
                            
Date: 2024-03-03T12:25:25.000+00:00 

Message

feat: re-sort results at search time to ensure ordering (utilities/file_location_listing#49)

This ensures that the search being split across worker threads does not result in indeterministic ordering

+8 -0 (8 lines changed)

Given that this isn't really "custom" ordering, I'm going to update the issue title.

changed title from {-Custom-} Result ordering to {+Deterministic+} Result ordering

mentioned in issue #46

In theory this is now done.

Just waiting on a crawl to complete in dev so that I can test the searches that I noticed this with.

Busy week...

Just checked, this is working as expected with the original search.

Screenshot_20240310_115138

verified

mentioned in commit e29721fc925c1dfadd90042118d696998be45938

Commit: e29721fc925c1dfadd90042118d696998be45938 
Author: B Tasker                            
                            
Date: 2024-03-10T11:57:31.000+00:00 

Message

fix: ensure last-mod-s is written as an integer (utilities/file_location_listing#49)

+1 -1 (2 lines changed)

However, testing with some other terms, I ran into an exception

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2213, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2193, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/ben/Documents/src.old/file_location_listing/./server/server.py", line 61, in search
    "results" : dosearch.search(srchterm, search_type, limit, count_only)
  File "/home/ben/Documents/src.old/file_location_listing/./server/../lib/dosearch.py", line 281, in search
    results.sort(reverse=True, key=lambda x: x.get("last-mod-s", 0))
TypeError: '<' not supported between instances of 'str' and 'int'

dumping out the items being returned:

'last-mod-s': '1580980886', 

It's because we've used time.strftime to get the epoch, but not cast to an integer.

Commit e29721fc fixes the way it's written into storage.

It's quite curious that it was working for the Wifi search though... those are all strings too.

Ahh, that's not why the other test failed though.

When searching for Visitor Wifi everything has a last-mod-s value. They're strings, but what matter is that they are all strings (and so can all be compared).

With the second search, that's not true. There are pages where last-mod-s is a string, but there are also entries where it's an integer: 0 (indicating that the page didn't return a last-mod) - the result is we try and compare a string to an integer and throw an exception.

The commit above should resolve that - we'll be storing as an integer.

Arguably, we could also add a cast to integer to the the sort line, but that would be introducing unnecessary overhead to searches.

Side note: I wondered how much more expensive calling int() on something that was already an int might be, turns out it's comparatively expensive:

#!/usr/bin/env python3

import time

# Define common bits
x = range(10000)
l = {
 "last-mod-s" : 12345,
 "flt" : 3.142
}

# Do a fetch with no conversion
start=time.time_ns()
for i in x:
  a = l.get("last-mod-s")
stop=time.time_ns() 
print(f"assignment: {stop - start}")

# Cast the int to an int
start=time.time_ns()
for i in x:
  a = int(l.get("last-mod-s"))
stop=time.time_ns()
print(f"int -> int: {stop - start}")

# Cast a float to an int
start=time.time_ns()
for i in x:
  a = int(l.get("flt"))
stop=time.time_ns()
print(f"float -> int: {stop - start}")

Result:

$ for i in {1..3}; do python3 t.py; done
assignment: 1405429
int -> int: 3870779
float -> int: 3887890
assignment: 1727261
int -> int: 3959828
float -> int: 4231792
assignment: 1786696
int -> int: 3617113
float -> int: 3946418

Given that that's for 10,000 iterations a human would never actually notice the difference whilst searching.

But, it's still unnecessary: no released version exists where last-mod-s has been stored as a string, the only system that would need a recrawl is my dev instance

Recrawl running at the moment so that I can re-test

Yep, that's done the trick.

One other option for the future: when loading the index, we could conceivably build an alpha-ordered in-memory copy - that would allow us to offer alphabetic ordered results. I'm not convinced we'll need it though.

Either way, closing this change as Done.

Worth explicitly noting here: after upgrade, the changes made in this issue/release won't take effect until after a recrawl (and the subsequent re-index) - without that, the last-modified date won't be in the storage headers.

mentioned in issue #54

mentioned in issue #55