project Utilities / File Location Listing avatar

utilities/file_location_listing#10: Meta Keywords Support



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.1
Created: 29-Dec-23 11:33



Description

A huge majority of the files being indexed are likely to be HTML pages.

It might be worth looking at extracting and indexing meta-keywords so they can be considered for search results.



Toggle State Changes

Activity


assigned to @btasker

Including meta keywords in the index will likely increase index size pretty dramatically, so the question is, does it bring any benefit that I'd take advantage of?

If I search for kapa I get 5 Kapacitor related hits back:

Screenshot_20231229_113538

But, if I look at the Kapacitor tag on my site, there are more post than that tagged with it

Screenshot_20231229_113607

So, I guess the question is: do we actually want those to be returned?

The second thing we'd need to consider with this is what kind of matching we use.

If I've got an article with the following tags

  • kapacitor
  • unfungled
  • technical
  • balloon

We potentially run into an issue.

I probably want that kapacitor keyword to be picked up on if I've searched for kapa.

But, if I've searched for ball or fun, do I really want it to return unfungled or balloon?

Conversely though, I probably don't want to have to remember exactly what the keyword that was used was - partial matches save me from that.

If we do implement this, it's probably worth also considering processing tags listed in Markdown files.

I use Obsidian and so quite often have a section that looks something like

----
### Tags

#foo #bar #sed
----

If we're going to be indexing tags for HTML, it'd probably be prudent to see if we can pull those out as well.

I think this is probably worth messing around with in a branch to see what the impact on the index is.

One option could be to maintain a second index for tags, with something along the lines of

tag1,key1,key2,key3,key4
tag2,key2,key3

This would allow tags to quickly + easily be searched without blowing up the main index.

Once we're extracting keywords, building that index would be relatively easy. The challenge really, is merging use of that index into the main search.

I think the order of ex

  • Iterate through the main index as usual
  • Iterate through the tags first, extracting keys for matching items and seeing if they exist in the result set
  • For keys that don't, look up their main index entry and see whether they meet constraints

That's potentially not very efficient though. It might be that an index entry fails (say) the domain constraint - we've no way to check that from the keyword index, so would have to scan the index again to find the entry, essentially looking up (and testing) that item twice.

The other option might be to maintain a second index on disk, but effectively fold it into the first at index load time (so that keywords can be checked when iterating the index).

Simply copying keywords into each index entry isn't very memory efficient, so we might want to instead populate a main list of keywords, assigning each an ID. A list of IDs can then be stored in the index entry.

At search time, we'd precalculate a list of matching keyword IDs to pass into the search worker.

I think I'll spin out a branch now to give this a whirl

I've hand edited somekeywords into store/fa/5b/fa5be46b3b1c481c260a687abb0b5924728048e45f664659be9e46d9759e8680

TAGS:["foo","bar", "vgbftdeee"]
verified

mentioned in commit f6d2bb6237ed3c96b327b300f6677a1513d44572

Commit: f6d2bb6237ed3c96b327b300f6677a1513d44572 
Author: B Tasker                            
                            
Date: 2023-12-31T17:37:49.000+00:00 

Message

feat: use tag index when searching (utilities/file_location_listing#10)

+65 -19 (84 lines changed)

We have a working proof of concept, able to return the hand edited entry when searching for vgbftdeee.

It needs some refinement, obviously, and we still need to have the crawler actually extract keywords, but it's not added as much complexity as I was concerned it might

OK, in terms of extracting tags from HTML pages, there are two things we want to look at

  • The meta keywords item
  • Any <meta property="article:tag"> items in the <head>

If one or both is present, they should be de-duped and stored.

verified

mentioned in commit e2ffb7dfbde37a3e816f101283a45ecabd9f375a

Commit: e2ffb7dfbde37a3e816f101283a45ecabd9f375a 
Author: B Tasker                            
                            
Date: 2023-12-31T19:17:12.000+00:00 

Message

feat: add support for extracting meta keywords and opengraph tags from crawled pages (utilities/file_location_listing#10)

+19 -3 (22 lines changed)
verified

mentioned in commit 66ae847db679486fb2600fdb7e2521d90cc3a10e

Commit: 66ae847db679486fb2600fdb7e2521d90cc3a10e 
Author: B Tasker                            
                            
Date: 2023-12-31T19:16:40.000+00:00 

Message

feat: add support for matchtype:tag (utilities/file_location_listing#10)

+3 -3 (6 lines changed)
verified

mentioned in commit defeb79d3e0ea0edcc603f8ce9c697e5e9339198

Commit: defeb79d3e0ea0edcc603f8ce9c697e5e9339198 
Author: B Tasker                            
                            
Date: 2023-12-31T20:08:15.000+00:00 

Message

feat: teach crawler to extract tags from markdown files (utilities/file_location_listing#10)

+60 -7 (67 lines changed)

mentioned in merge request !1

Changes have been merged in.

Re-opening for follow up.

I ran the crawler against a real dataset overnight - tag extraction works but is somewhat overeager.

We've indexed tags like if because of blocks like this in some notes

    #if [[ "$OFFSET" -gt "0" ]]
    #then
    #    tail="tail -n $LIMIT"
    #    LIMIT=$(( $LIMIT + $OFFSET ))
    #    partitions=`find ./ -name '*lp.gz' | awk -F '_' '{print $NF}' | sort | uniq | head -n $LIMIT | $tail`
    #fi

There isn't really a good way to avoid picking up on things like that.

I think, realistically, we're going to have to implement an arbitrary ruleset:

  • Obsidian style tags will only be extracted if the word tags has been observed in a prior line
  • If tags has been observed, extraction attempts will stop at the first horizontal rule (----)

It's not exactly a generalised ruleset, but it'll work with my obsidian notes because most of my templates contain something like

----
### Tags

#tag1 #tag2

----
verified

mentioned in commit 0b9fb275b47c6394b93cce2077dba61b950e2d75

Commit: 0b9fb275b47c6394b93cce2077dba61b950e2d75 
Author: B Tasker                            
                            
Date: 2024-01-01T10:57:37.000+00:00 

Message

fix: only extract obsidian style tags if the word "tags" has been observed (utilities/file_location_listing#10)

+10 -1 (11 lines changed)

That issue is now fixed.

After re-crawling, the tag index looks much more along the lines of what I was expecting.

Tested:

  • Tags use partial matching by default (so kapa will match kapacitor)
  • Tag matching uses AND by default (i.e. all terms must exist within a tag)
  • NOT filters work correctly (as of the latest commit, fixing #27)
  • modetype:or and modetype:exact works correctly

The tags index is a fairly sensible size:

$ du -sh *
1.7M    index
124M    store
12K tags

The index currently primarily consists of tags pulled from markdown though - need to trigger a wider crawl so we pick up some HTML tags.

I've just crawled www.bentasker.co.uk (which uses opengraph tags) - the index now contains 693 tags.

$ du -sh *
2.6M    index
194M    store
56K tags
verified

mentioned in commit 472be7245e013098a0b66bd7f33a319ab8d63de8

Commit: 472be7245e013098a0b66bd7f33a319ab8d63de8 
Author: B Tasker                            
                            
Date: 2024-01-01T12:24:24.000+00:00 

Message

feat: add env var to turn tag indexes off (utilities/file_location_listing#10)

+14 -7 (21 lines changed)

As a precaution, I've added an Off switch - tag index build (and consumption) can be disabled by setting Env var TAGS_ENABLED to anything except y or Y

mentioned in issue #37