utilities/file_location_listing#49 introduced deterministic result ordering. Search results are returned in reversed chronological order, based on the last-modified
date that was returned when the page was last indexed.
This works relatively well for things like my notes (which are rarely, if ever, updated).
It works much less well for things that are routinely regenerated (like https://projects.bentasker.co.uk/gils_projects/index.html). The entire site is regenerated a couple of times a day, with the effect that those pages are almost always going to be ordered first.
We should add support for parsing metadata to extract a published date from HTML pages
Looking at www.bentasker.co.uk
, Nikola inserts microdata:
<meta property="article:published_time" content="2024-05-31T09:56:00Z">
So we could perhaps look at extracting that. Crawl or search time config could then be used to determine (per site) whether last-mod
or published
should be used for ordering purposes.
Activity
02-Jun-24 09:40
assigned to @btasker
02-Jun-24 09:59
The catch with this is that extracting a published date isn't quite as simple as it first sounds.
There's no one standard in semantic markup for publication dates. Nikola is using opengraph-esque markup, but some others use schema.org style
itemprop="dateCreated"
to denote a creation date.If microdata has been used, there may be multiple creation dates within a page, because it's possible to nest entities:
So, it's not simply a case of looking for
dateCreated
ordatePublished
, we'd need to track which scope we're in and only use the date that relates to the main content.Once you get past that, you also need to consider modification dates: If I wrote an article in January and then updated it yesterday, should it be ordered first or further down?
If it should be first, we'd need to be able to extract a update date from the markup (because the server header can't be trusted in the scenarios this feature caters too).
There's a range of possible vocabs to choose from there too though
schema.org
markup tends to havedateModified
og:updated_time
article:modified_time
(assuming, of course, the page is of type article in the first place), though Nikola doesn't appear to add these when a post is updateditemprop="dateUpdated"
to mark the update dateAll told it's a bit hard not to think of XKCD 927:
I think this is probably something I'm going to want to do eventually, but it's more complex than it sounds and better not rushed into.