project Utilities / File Location Listing avatar

utilities/file_location_listing#55: Allow published date to override last-mod for ordering purposes



Issue Information

Issue Type: issue
Status: opened
Reported By: btasker
Assigned To: btasker

Milestone: backlog
Created: 02-Jun-24 09:40



Description

utilities/file_location_listing#49 introduced deterministic result ordering. Search results are returned in reversed chronological order, based on the last-modified date that was returned when the page was last indexed.

This works relatively well for things like my notes (which are rarely, if ever, updated).

It works much less well for things that are routinely regenerated (like https://projects.bentasker.co.uk/gils_projects/index.html). The entire site is regenerated a couple of times a day, with the effect that those pages are almost always going to be ordered first.

We should add support for parsing metadata to extract a published date from HTML pages

Looking at www.bentasker.co.uk, Nikola inserts microdata:

<meta property="article:published_time" content="2024-05-31T09:56:00Z">

So we could perhaps look at extracting that. Crawl or search time config could then be used to determine (per site) whether last-mod or published should be used for ordering purposes.



Toggle State Changes

Activity


assigned to @btasker

The catch with this is that extracting a published date isn't quite as simple as it first sounds.

There's no one standard in semantic markup for publication dates. Nikola is using opengraph-esque markup, but some others use schema.org style itemprop="dateCreated" to denote a creation date.

If microdata has been used, there may be multiple creation dates within a page, because it's possible to nest entities:

<div itemscope itemtype="https://schema.org/Article">
   <h1 itemprop="name">Foo</h1>
   <time itemprop="datePublished" datetime="2024-01-01">Jan 1, 2024</time>

  <div class="comment" itemscope itemtype="https://schema.org/Comment">
    <time itemprop="datePublished" datetime="2024-01-05">Jan 5, 2024</time>
    <span itemprop="text">This is great</span>
  </div>
</div>

So, it's not simply a case of looking for dateCreated or datePublished, we'd need to track which scope we're in and only use the date that relates to the main content.

Once you get past that, you also need to consider modification dates: If I wrote an article in January and then updated it yesterday, should it be ordered first or further down?

If it should be first, we'd need to be able to extract a update date from the markup (because the server header can't be trusted in the scenarios this feature caters too).

There's a range of possible vocabs to choose from there too though

  • schema.org markup tends to have dateModified
  • Opengraph previously had og:updated_time
  • Opengraph now uses article:modified_time (assuming, of course, the page is of type article in the first place), though Nikola doesn't appear to add these when a post is updated
  • Nikola uses itemprop="dateUpdated" to mark the update date

All told it's a bit hard not to think of XKCD 927:

XKCD Standards: Situation, there are 14 competing standards. 14? ridiculous, we need to develop a universal standard that covers everyone's usecases. Situation, there are 15 competing standards

I think this is probably something I'm going to want to do eventually, but it's more complex than it sounds and better not rushed into.