project Utilities / File Location Listing avatar

utilities/file_location_listing#37: Markdown Specific Indexing (was Markdown rendering)



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.5
Created: 07-Jan-24 11:07



Description

Although we support indexing markdown files, we currently just treat them as text files with special rules applied.

This means that we can extract title, tags and outbound links from markdown files.

However, if a markdown file references an image, we don't currently pick up on them (and must instead find the image - if possible - through directory navigation).

If the file were, instead, HTML then for each of those images we'd

  • Pick up on the image
  • Associate it with alt-text (if present, if not, we take the page title)
  • Note that it's embedded in a HTML page at [url]

Given that the majority of my notes are in md, it'd be nice if we could do the same with those.

We could just expand the ruleset to include supporting images, but a certain point we may find we're just building a markdown parser - it probably makes as much sense to use an off-the-shelf parser to turn it into a blob of HTML so that we can treat it as if it were HTML.



Toggle State Changes

Activity


assigned to @btasker

Of course, the issue with rendering is that it does potentially present a small security risk: we'll be passing arbitrary content into a parser - whilst the risk of that should be small, it is not 0.

But, to a certain extent, that same risk exists when we pass a HTML file into BeautifulSoup.

Although importing markdown to render is pretty straightforward, if we do this it'll break the markdown tagging support introduced in #10

A lot of my notes (probably the main thing I'm interested in location) use the Obsidian tagging layout:

----
### Tags

#foo #bar #sed
----

If we process the file as if it's HTML we'll lose the ability to (reliably) extract those.

So, I think it might be better, after all, to look at teaching the markdown logic to be able to parse something like

![This is some alt text](http://example.com/foo.jpg)

so that the alt-text and/or the document title gets associated

verified

mentioned in commit f1373e82a85b2dbb583f80e46b25a02fbb147ffb

Commit: f1373e82a85b2dbb583f80e46b25a02fbb147ffb 
Author: B Tasker                            
                            
Date: 2024-01-13T00:18:47.000+00:00 

Message

feat: give extractMetaFromMarkdown() the ability to extract images and alt-tags (utilities/file_location_listing#37)

+25 -5 (30 lines changed)

The feature branch has the basic functionality in now, it can read a markdown file and extract links and image anchors.

I'll play around with it a bit before merging tomorrow

As we've got things cordoned off in a branch anyway, I sort of wonder whether it isn't worth taking the time to also get relative links (as opposed to images) working.

At the moment, if we had the following markdown

# Some doc

----
### Tags

#foo #bar #sed

----

![Intro image](../image.png)

This is a random document mentioning https://www.example.com

It also has a [link](../foo/bar.md) to another doc

We would extract

  • The title
  • The tags
  • The image (and its alt-text)
  • A link to https://www.example.com

Which, on reflection is actually a little weird - we haven't actually linked to example.com at all (although some renderers might auto-link it), but we've extracted that and not an actual link.

Although normally (and certainly in my notes) we'd end up finding bar.md during directory traversal, that might not always be the case: if MD is served without directory indexing enabled for example.

So, yeah, I think it's probably worth spending a little bit of time implementing support for that too.

This turned out simpler to do than expected.

I changed the regex used for the image extraction to collect the first char, going from !\[([^\]]+)?\]\(([^\)]+)\) to (.)?\[([^\]]+)?\]\(([^\)]+)\). The code can then check the 1st group to see whether we're processing an image or a link.

Preparing to merge now.

changed title from Markdown {-R-}endering to Markdown {+Specific Indexing (was Markdown r+}endering{+)+}

mentioned in merge request !3

mentioned in commit bf7d4f0f322c530e019c8e3d25da531af1d63482

Commit: bf7d4f0f322c530e019c8e3d25da531af1d63482 
Author: Ben Tasker                            
                            
Date: 2024-01-13T11:34:32.000+00:00 

Message

feat: implement proper markdown processing

  • tidy: we're not actually consuming domain from queued items, remove it
  • Prevent queueing of any markdown source URLs/images that we wouldn't be allowed to crawl
  • Implement support for markdown links

    Previously, we extracted links by using a regex to find URLs (just as we do with plain text). This commit implements support for processing the [title](target) construct, including handling relative links

  • refactor to remove other examples of repeatedly calculating absoluteness

  • handle relative image paths
  • Extract normal links, ignoring anything that's already been extracted as an image
  • feat: give extractMetaFromMarkdown() the ability to extract images and alt-tags (utilities/file_location_listing#37)
+70 -31 (101 lines changed)

mentioned in commit ee21b4657c68a848e3d3bc09851b55161a0112e9

Commit: ee21b4657c68a848e3d3bc09851b55161a0112e9 
Author: Ben Tasker                            
                            
Date: 2024-01-13T11:34:32.000+00:00 

Message

Merge branch 'markdown-processing' into 'main'

feat: implement proper markdown processing

Closes #37

See merge request utilities/file_location_listing!3

+70 -31 (101 lines changed)