Including meta keywords in the index will likely increase index size pretty dramatically, so the question is, does it bring any benefit that I'd take advantage of?
If I search for kapa I get 5 Kapacitor related hits back:
But, if I look at the Kapacitor tag on my site, there are more post than that tagged with it
So, I guess the question is: do we actually want those to be returned?
One option could be to maintain a second index for tags, with something along the lines of
tag1,key1,key2,key3,key4
tag2,key2,key3
This would allow tags to quickly + easily be searched without blowing up the main index.
Once we're extracting keywords, building that index would be relatively easy. The challenge really, is merging use of that index into the main search.
I think the order of ex
Iterate through the main index as usual
Iterate through the tags first, extracting keys for matching items and seeing if they exist in the result set
For keys that don't, look up their main index entry and see whether they meet constraints
That's potentially not very efficient though. It might be that an index entry fails (say) the domain constraint - we've no way to check that from the keyword index, so would have to scan the index again to find the entry, essentially looking up (and testing) that item twice.
The other option might be to maintain a second index on disk, but effectively fold it into the first at index load time (so that keywords can be checked when iterating the index).
Simply copying keywords into each index entry isn't very memory efficient, so we might want to instead populate a main list of keywords, assigning each an ID. A list of IDs can then be stored in the index entry.
At search time, we'd precalculate a list of matching keyword IDs to pass into the search worker.
I think I'll spin out a branch now to give this a whirl
We have a working proof of concept, able to return the hand edited entry when searching for vgbftdeee.
It needs some refinement, obviously, and we still need to have the crawler actually extract keywords, but it's not added as much complexity as I was concerned it might
As a precaution, I've added an Off switch - tag index build (and consumption) can be disabled by setting Env var TAGS_ENABLED to anything except y or Y
Activity
29-Dec-23 11:33
assigned to @btasker
29-Dec-23 11:37
Including meta keywords in the index will likely increase index size pretty dramatically, so the question is, does it bring any benefit that I'd take advantage of?
If I search for
kapa
I get 5 Kapacitor related hits back:But, if I look at the Kapacitor tag on my site, there are more post than that tagged with it
So, I guess the question is: do we actually want those to be returned?
29-Dec-23 11:40
The second thing we'd need to consider with this is what kind of matching we use.
If I've got an article with the following tags
kapacitor
unfungled
technical
balloon
We potentially run into an issue.
I probably want that
kapacitor
keyword to be picked up on if I've searched forkapa
.But, if I've searched for
ball
orfun
, do I really want it to returnunfungled
orballoon
?Conversely though, I probably don't want to have to remember exactly what the keyword that was used was - partial matches save me from that.
29-Dec-23 11:42
If we do implement this, it's probably worth also considering processing tags listed in Markdown files.
I use Obsidian and so quite often have a section that looks something like
If we're going to be indexing tags for HTML, it'd probably be prudent to see if we can pull those out as well.
29-Dec-23 11:42
I think this is probably worth messing around with in a branch to see what the impact on the index is.
31-Dec-23 16:44
One option could be to maintain a second index for tags, with something along the lines of
This would allow tags to quickly + easily be searched without blowing up the main index.
Once we're extracting keywords, building that index would be relatively easy. The challenge really, is merging use of that index into the main search.
I think the order of ex
That's potentially not very efficient though. It might be that an index entry fails (say) the
domain
constraint - we've no way to check that from the keyword index, so would have to scan the index again to find the entry, essentially looking up (and testing) that item twice.The other option might be to maintain a second index on disk, but effectively fold it into the first at index load time (so that keywords can be checked when iterating the index).
Simply copying keywords into each index entry isn't very memory efficient, so we might want to instead populate a main list of keywords, assigning each an ID. A list of IDs can then be stored in the index entry.
At search time, we'd precalculate a list of matching keyword IDs to pass into the search worker.
I think I'll spin out a branch now to give this a whirl
31-Dec-23 16:48
I've hand edited somekeywords into
store/fa/5b/fa5be46b3b1c481c260a687abb0b5924728048e45f664659be9e46d9759e8680
31-Dec-23 17:38
mentioned in commit f6d2bb6237ed3c96b327b300f6677a1513d44572
Message
feat: use tag index when searching (utilities/file_location_listing#10)
31-Dec-23 17:39
We have a working proof of concept, able to return the hand edited entry when searching for
vgbftdeee
.It needs some refinement, obviously, and we still need to have the crawler actually extract keywords, but it's not added as much complexity as I was concerned it might
31-Dec-23 19:03
OK, in terms of extracting tags from HTML pages, there are two things we want to look at
keywords
item<meta property="article:tag">
items in the<head>
If one or both is present, they should be de-duped and stored.
31-Dec-23 19:17
mentioned in commit e2ffb7dfbde37a3e816f101283a45ecabd9f375a
Message
feat: add support for extracting meta keywords and opengraph tags from crawled pages (utilities/file_location_listing#10)
31-Dec-23 19:17
mentioned in commit 66ae847db679486fb2600fdb7e2521d90cc3a10e
Message
feat: add support for
matchtype:tag
(utilities/file_location_listing#10)31-Dec-23 20:08
mentioned in commit defeb79d3e0ea0edcc603f8ce9c697e5e9339198
Message
feat: teach crawler to extract tags from markdown files (utilities/file_location_listing#10)
31-Dec-23 20:09
mentioned in merge request !1
31-Dec-23 20:10
Changes have been merged in.
01-Jan-24 10:47
Re-opening for follow up.
I ran the crawler against a real dataset overnight - tag extraction works but is somewhat overeager.
We've indexed tags like
if
because of blocks like this in some notesThere isn't really a good way to avoid picking up on things like that.
I think, realistically, we're going to have to implement an arbitrary ruleset:
tags
has been observed in a prior linetags
has been observed, extraction attempts will stop at the first horizontal rule (----
)It's not exactly a generalised ruleset, but it'll work with my obsidian notes because most of my templates contain something like
01-Jan-24 10:58
mentioned in commit 0b9fb275b47c6394b93cce2077dba61b950e2d75
Message
fix: only extract obsidian style tags if the word "tags" has been observed (utilities/file_location_listing#10)
01-Jan-24 11:46
That issue is now fixed.
After re-crawling, the tag index looks much more along the lines of what I was expecting.
Tested:
kapa
will matchkapacitor
)modetype:or
andmodetype:exact
works correctlyThe
tags
index is a fairly sensible size:The index currently primarily consists of tags pulled from markdown though - need to trigger a wider crawl so we pick up some HTML tags.
01-Jan-24 12:16
I've just crawled
www.bentasker.co.uk
(which uses opengraph tags) - the index now contains 693 tags.01-Jan-24 12:24
mentioned in commit 472be7245e013098a0b66bd7f33a319ab8d63de8
Message
feat: add env var to turn tag indexes off (utilities/file_location_listing#10)
01-Jan-24 12:25
As a precaution, I've added an Off switch - tag index build (and consumption) can be disabled by setting Env var
TAGS_ENABLED
to anything excepty
orY
13-Jan-24 00:01
mentioned in issue #37