project Utilities / File Location Listing avatar

utilities/file_location_listing#65: Add support for YAML frontmatter to markdown parsing



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.9
Created: 22-Aug-24 09:07



Description

The spider supports indexing markdown files (introduced in utilities/file_location_listing#5)

However, it relies on the file title being the first line, i.e.

# My file title

blah blah

I currently use Obsidian to manage notes and I'd quite like to be able to add some YAML frontmatter to my templates (because I can then query it with the dataview plugin.

---
category: foo
somevar: somevalue
---
# My file title

blah blah

However, if I were to do this, the indexer would stop collecting titles from these files, because it expects the title to be on the first line and checks that it looks like a heading:

    # The title should be on the first line
    if lines[0].startswith('#') or lines[1].startswith("---") or lines[1].startswith("==="):
        title = lines[0].strip("#").strip

I'd like to update the logic so that frontmatter can be handled (it'd be an added bonus if we could collect it into storage, but the most pressing need is to make sure it doesn't break anything)



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 4d6f1d3186afda7530cd1a9dc0c3835c570c6ce1

Commit: 4d6f1d3186afda7530cd1a9dc0c3835c570c6ce1 
Author: B Tasker                            
                            
Date: 2024-08-22T14:06:40.000+01:00 

Message

feat: collect YAML frontmatter if it exists (utilities/file_location_listing#65)

+22 -1 (23 lines changed)
verified

mentioned in commit 36e8227c4d91f6acedda5f9f5be3e8a376609871

Commit: 36e8227c4d91f6acedda5f9f5be3e8a376609871 
Author: B Tasker                            
                            
Date: 2024-08-22T14:08:43.000+01:00 

Message

feat: Markdown parser returns frontmatter info (utilities/file_location_listing#65)

Note: nothing currently happens with this, there isn't really an equivalent for other filetypes so we don't currently have anywhere to store it.

Needs a bit of thought to decide

  • Do we add it to storage in it's raw form (perhaps creating a generic metadata attrib)?
  • Do we extract well-known values (such as Category) and inject tags?
+2 -2 (4 lines changed)
verified

mentioned in commit ede29961f4ea416e611a1ee99127cad2bceb4d1d

Commit: ede29961f4ea416e611a1ee99127cad2bceb4d1d 
Author: B Tasker                            
                            
Date: 2024-08-22T14:53:25.000+01:00 

Message

feat: add generic metadata attribute to storage files and write frontmatter in under it (utilities/file_location_listing#65)

+4 -1 (5 lines changed)

The commit above introduces a new metadata attribute to storage files.

The idea is that this is an extensible section which can have arbitrary attributes added under it - in this case, we've added one called frontmatter.

Although it's not currently implemented (and won't be under the heading of this issue), the plan is that we'll add a new index key which indicates what metadata attributes a stored item has.

In this case that might look something like

METADATA-ATTRIBS: ['frontmatter']

That'll allow indexes to be used to speed up searches which match against something that only exists in metadata. Not that there's currently syntax to support it, but if we're doing a search for documents who's frontmatter includes the key "foo" with value "bar" we'd be able to quickly narrow the search set down to only include documents that actually have frontmatter.

It might also be that we want to make the storage headers extensible, perhaps including something like

X-METADATA-ATTRIB-frontmatter: ['foo:bar']

But I think that path probably leads to pain once you start thinking about how to incorporate that into indexes

For now, we keep this simple - the changes should mean than indexing files doesn't break if they've got front-matter.

That's the main change that I need.

Once that's tested and definitely working, we can look at collapsing key-pairs into tag values to make them searchable, but the main thing is I want to be able to release soon (this weekend ideally) so that I can start adding front-matter to my notes.

Edit: actually, it looks like that injection is a 1 line change, so I'll do it now.

verified

mentioned in commit b9fece2d7fe7f91fd6db2e9607bc94d735553494

Commit: b9fece2d7fe7f91fd6db2e9607bc94d735553494 
Author: B Tasker                            
                            
Date: 2024-08-23T08:28:35.000+01:00 

Message

feat: use frontmatter to inject tags (utilities/file_location_listing#65)

+4 -1 (5 lines changed)

OK, test crawl running now

There's a test file in there with the following content:

---
Category: foo
CatName: foo Bar Sed
age: 40
---
# Test file

This is my test file

With a couple of typos fixed, it looks to have worked:

screenshot of test file in search result

Additional things checked

  • searches/indexing still working correctly for pages without front-matter
  • search parsing doesn't treat those tags as a dork, I can search for them

There is one small problem though - the current dork processing looks like this

    dorks = [
        'content-type', 
        'domain', 
        'ext', 
        'hastitle', 
        'matchtype', 
        'mode',
        'prefix'
        ]

...

        # It's a search-fu
        t_sp = t.split(":")
        if t_sp[0] in dorks:
            filters["fu"][t_sp[0]] = t_sp[1].strip()

In the example above, the values are Category:foo, CatName:foo_Bar_Sed and age:40. None of those prefixes exist in dorks, so it's treated as a normal search.

However, that may not always be true. If our markdown doc looked like this

---
Category: foo
domain: www.bentasker.co.uk
---
# Some Doc

We'd run into trouble. It wouldn't be possible to search for tag value domain:foo because it would get interpreted as a dork and only results from www.bentasker.co.uk would be displayed (so if this file were on a different domain, it wouldn't appear in results)

The result would potentially be quite confusing: At best, you'd get no results back despite knowing that there was something there, at worst you'd get incomplete results back (meaning you may not notice it was inaccurate).

It might be better to collapse frontmatter into something more akin to scoped tags - i.e. Category::foo rather than Category:foo. It'd be trivial to have the dork logic skip anything with a double colon

The other thing that we need to consider, though, is that the current implementation doesn't actually parse the YAML.

So there are all sorts of YAML supported things which aren't currently handled, for example:

---
foo: ['bar','sed']
a:
 - b
 - c
 - d

It might instead be prudent to pass it into Python's YAML parser, then we can infer type when injecting tags (injecting multiple for relevant lists).

We should also add special support for tag and tags as Obsidian explicitly includes special handling for those.

verified

mentioned in commit 91e615305b6bcef257ea5a56ef5359aaad5c451e

Commit: 91e615305b6bcef257ea5a56ef5359aaad5c451e 
Author: B Tasker                            
                            
Date: 2024-08-24T10:34:49.000+01:00 

Message

feat: attempt to parse YAML frontmatter and handle entries with multiple values (utilities/file_location_listing#65)

+37 -8 (45 lines changed)

The test file now has the following contents

---
Category: foo
CatName: foo Bar Sed
age: 40
tagset:
  - a
  - b
  - c
taglist: ['foo', 'bar', 'sed']


---
# Test file

This is my test file

As of the commit above, the stored tags for this page are

TAGS:["Category::foo", "CatName::foo_Bar_Sed", "tagset::a", "tagset::b", "tagset::c", "taglist::foo", "taglist::bar", "taglist::sed"]

Within the JSON payload, the metadata attribute has the following:

{"frontmatter": {"Category": "foo", "CatName": "foo Bar Sed", "age": 40, "tagset": ["a", "b", "c"], "taglist": ["foo", "bar", "sed"]}}

Which is pretty much what we wanted.

We don't currently support nested objects in the YAML though:

foo:
  bar: 1
  sed: 2
  zoo: 3

Although it'll still appear under metadata, it won't result in any injected tags.

We should also add special support for tag and tags as Obsidian explicitly includes special handling for those.

I'm intending to do this once I've special cased :: in search term parsing.

verified

mentioned in commit 3301b312f871a046ad60be4cc2d6d92fb203a5e7

Commit: 3301b312f871a046ad60be4cc2d6d92fb203a5e7 
Author: B Tasker                            
                            
Date: 2024-08-24T10:43:36.000+01:00 

Message

feat: special-case a double colon so it can't conflict with dorks (utilities/file_location_listing#65)

This means that a markdown document can include domain: foo in its YAML frontmatter without the resulting tagsearch being interpreted as a dork

+28 -27 (55 lines changed)

OK, on to special casing of YAML items.

Obsidian's doc lists a set of defaults:

Default YAML properties in Obsidian

It also notes that it used to support singular versions of these (tag,alias and cssclass) but that these were deprecated and should not be used.

Realistically, deprecated or not, we should probably special case them to ensure historic docs are still supported.

Although not listed in the defaults table, the doc also references title - it'd make sense to support setting the page title from this.

Down the line, it'd be nice if we could index aliases too, but that's not for today - for now, we just won't special case (then they'll be indexed as tags)

So, there are three changes that need to be made as a result of this

  • if the attribute name is tag or tags, don't prefix the tag value
  • if the title attribute exists, set the page title from that
  • if the attribute is cssclasses or cssclass, don't inject a tag
verified

mentioned in commit 32d658aedf320effce88255043160d428193358a

Commit: 32d658aedf320effce88255043160d428193358a 
Author: B Tasker                            
                            
Date: 2024-08-24T10:57:38.000+01:00 

Message

feat: special case specific YAML property names (utilities/file_location_listing#65)

+15 -3 (18 lines changed)
verified

mentioned in commit 94ca7298c3c5eb9def1c669cf028565545310ae2

Commit: 94ca7298c3c5eb9def1c669cf028565545310ae2 
Author: B Tasker                            
                            
Date: 2024-08-24T11:00:20.000+01:00 

Message

feat: special case the title attribute (utilities/file_location_listing#65)

If a title property exists within YAML front matter we'll use it's value as the page title. We also won't inject a tag with prefix title::

+7 -3 (10 lines changed)
verified

mentioned in commit 7ec30172f8351f4413b27716b9a9188be711a6da

Commit: 7ec30172f8351f4413b27716b9a9188be711a6da 
Author: B Tasker                            
                            
Date: 2024-08-24T11:05:26.000+01:00 

Message

fix: convert dates in frontmatter to string (utilities/file_location_listing#65)

This ensures that we'll be able to serialise to JSON for storage later

+4 -1 (5 lines changed)
verified

mentioned in commit 74d748bd7619abaf2f792ca6bf8dfd3da9394bc6

Commit: 74d748bd7619abaf2f792ca6bf8dfd3da9394bc6 
Author: B Tasker                            
                            
Date: 2024-08-24T11:20:18.000+01:00 

Message

feat: implement support for float, int, bool etc (utilities/file_location_listing#65)

Note: this also moves the previous datetime serialisation fix to storage

+17 -9 (26 lines changed)
verified

mentioned in commit 46b62754f5e2d9a20f88c5221e70623530636367

Commit: 46b62754f5e2d9a20f88c5221e70623530636367 
Author: B Tasker                            
                            
Date: 2024-08-24T11:30:09.000+01:00 

Message

chore: make YAML frontmatter settings configurable (utilities/file_location_listing#65)

+16 -3 (19 lines changed)
verified

mentioned in commit 715ce0ba941bfae0e151f9bfb631f0a6f5071373

Commit: 715ce0ba941bfae0e151f9bfb631f0a6f5071373 
Author: B Tasker                            
                            
Date: 2024-08-24T11:32:28.000+01:00 

Message

feat: allow frontmatter parsing to be disabled (utilities/file_location_listing#65)

Setting env var YML_LOAD_FRONTMATTER to a value other than true will prevent parsing of front matter.

Note: it'll still be popped out to ensure that markdown title detection works

+2 -1 (3 lines changed)

I think we might now be done here.

The following test file works just fine

---
title: Real title
Category: foo
CatName: foo Bar Sed
age: 40
tags:
  - a
  - b
  - c
tag: ['foo', 'bar', 'sed']
cssclass: foobar
cssclasses: ['class1', 'class2']
date: 2024-07-16T00:00:00Z
float: 1.3
mixeds:
  - 2024-07-16T00:00:00Z
  - a
  - 1.5
  - 1
  - true

---
# Test file

This is my test file

The new functionality is controlled by environment variables:

  • YML_LOAD_FRONTMATTER: should frontmatter be parsed (default: True)
  • YML_NO_PREFIX: which YAML properties shouldn't be prefixed when injecting tags (default: tag,tags)
  • YML_NO_TAG: which YAML properties shouldn't result in a tag being injected (default: cssclass,cssclasses,title,date)
  • YML_LOAD_TITLE: should a YAML title property be used to set the page title?