#37 Markdown Specific Indexing (was Markdown rendering) : utilities/file_location

btasker Permalink
07-Jan-24 11:07

assigned to @btasker

btasker Permalink
07-Jan-24 11:09

Of course, the issue with rendering is that it does potentially present a small security risk: we'll be passing arbitrary content into a parser - whilst the risk of that should be small, it is not 0.

But, to a certain extent, that same risk exists when we pass a HTML file into BeautifulSoup.

btasker Permalink
13-Jan-24 00:01

Although importing markdown to render is pretty straightforward, if we do this it'll break the markdown tagging support introduced in #10

A lot of my notes (probably the main thing I'm interested in location) use the Obsidian tagging layout:

----
### Tags

#foo #bar #sed
----

If we process the file as if it's HTML we'll lose the ability to (reliably) extract those.

So, I think it might be better, after all, to look at teaching the markdown logic to be able to parse something like

![This is some alt text](http://example.com/foo.jpg)

so that the alt-text and/or the document title gets associated

btasker Permalink
13-Jan-24 00:40

verified

mentioned in commit f1373e82a85b2dbb583f80e46b25a02fbb147ffb

Commit: f1373e82a85b2dbb583f80e46b25a02fbb147ffb 
Author: B Tasker                            
                            
Date: 2024-01-13T00:18:47.000+00:00

Message

feat: give extractMetaFromMarkdown() the ability to extract images and alt-tags (utilities/file_location_listing#37)

+25 -5 (30 lines changed)

btasker Permalink
13-Jan-24 00:41

The feature branch has the basic functionality in now, it can read a markdown file and extract links and image anchors.

I'll play around with it a bit before merging tomorrow

btasker Permalink
13-Jan-24 11:09

As we've got things cordoned off in a branch anyway, I sort of wonder whether it isn't worth taking the time to also get relative links (as opposed to images) working.

At the moment, if we had the following markdown

# Some doc

----
### Tags

#foo #bar #sed

----

![Intro image](../image.png)

This is a random document mentioning https://www.example.com

It also has a [link](../foo/bar.md) to another doc

We would extract

The title
The tags
The image (and its alt-text)
A link to https://www.example.com

Which, on reflection is actually a little weird - we haven't actually linked to example.com at all (although some renderers might auto-link it), but we've extracted that and not an actual link.

Although normally (and certainly in my notes) we'd end up finding bar.md during directory traversal, that might not always be the case: if MD is served without directory indexing enabled for example.

So, yeah, I think it's probably worth spending a little bit of time implementing support for that too.

btasker Permalink
13-Jan-24 11:31

This turned out simpler to do than expected.

I changed the regex used for the image extraction to collect the first char, going from !\[([^\]]+)?\]\(([^\)]+)\) to (.)?\[([^\]]+)?\]\(([^\)]+)\). The code can then check the 1st group to see whether we're processing an image or a link.

Preparing to merge now.

btasker Permalink
13-Jan-24 11:31

changed title from Markdown {-R-}endering to Markdown {+Specific Indexing (was Markdown r+}endering{+)+}

btasker Permalink
13-Jan-24 11:32

mentioned in merge request !3

btasker Permalink
13-Jan-24 11:34

mentioned in commit bf7d4f0f322c530e019c8e3d25da531af1d63482

Commit: bf7d4f0f322c530e019c8e3d25da531af1d63482 
Author: Ben Tasker                            
                            
Date: 2024-01-13T11:34:32.000+00:00

Message

feat: implement proper markdown processing

tidy: we're not actually consuming domain from queued items, remove it
Prevent queueing of any markdown source URLs/images that we wouldn't be allowed to crawl
Implement support for markdown links

Previously, we extracted links by using a regex to find URLs (just as we do with plain text). This commit implements support for processing the [title](target) construct, including handling relative links
refactor to remove other examples of repeatedly calculating absoluteness
handle relative image paths
Extract normal links, ignoring anything that's already been extracted as an image
feat: give extractMetaFromMarkdown() the ability to extract images and alt-tags (utilities/file_location_listing#37)

+70 -31 (101 lines changed)

btasker Permalink
13-Jan-24 11:34

mentioned in commit ee21b4657c68a848e3d3bc09851b55161a0112e9

Commit: ee21b4657c68a848e3d3bc09851b55161a0112e9 
Author: Ben Tasker                            
                            
Date: 2024-01-13T11:34:32.000+00:00

Message

Merge branch 'markdown-processing' into 'main'

feat: implement proper markdown processing

Closes #37

See merge request utilities/file_location_listing!3

+70 -31 (101 lines changed)

utilities/file_location_listing#37: Markdown Specific Indexing (was Markdown rendering)

Issue Information

Activity