project Utilities / File Location Listing avatar

utilities/file_location_listing#60: Experimental: indexing headings



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.8
Created: 16-Jun-24 09:12



Description

This one is currently more out of interest than any identified direct need.

As an example, if we had a markdown document at https://example.invalid/foo.txt with the following content

# Foo Bar Foo Bar

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

### Dignissim diam quis

Amet nisl suscipit adipiscing bibendum est

### Consequat mauris nunc

Amet mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien.

We'd currently index https://example.invalid/foo.txt and Foo Bar Foo Bar.

Which is fine if the whole document relates to those things. But, what if Dignissim diam quis is a complete tangent? It's not going to be represented in the index.

So, there's a reasonable argument that maybe we should index headings as well.



Toggle State Changes

Activity


assigned to @btasker

Extracting headings isn't any particular bother, but it leads us onto the question of how best to index them. Do we:

  1. Treat them like tags and have a dedicated index for them
  2. Effectively roll them into the main index

Option 1 works best if we assume that (like tags) there's a high level of duplication between pages. That's certainly likely to be true to some extent (with headers like Background and Conclusion being pretty common).

That, though, is a big part of why I'm curious to play around with this - I've no idea whether duplicated headers are the exception or the rule.

If they are the rule, then we probably want to create an ignoretags equivalent so that we don't bother indexing the uninteresting ones (like Conclusion).

verified

mentioned in commit 1b6455f296fe683e6d595d40dd31a1b74f4f3580

Commit: 1b6455f296fe683e6d595d40dd31a1b74f4f3580 
Author: B Tasker                            
                            
Date: 2024-07-21T10:48:41.000+01:00 

Message

feat: extract and store headings from markdown documents (utilities/file_location_listing#60)

+25 -4 (29 lines changed)
verified

mentioned in commit c3311ca545c81c58c4ec0c06bad5dc0b6f54bee4

Commit: c3311ca545c81c58c4ec0c06bad5dc0b6f54bee4 
Author: B Tasker                            
                            
Date: 2024-07-21T10:54:35.000+01:00 

Message

feat: extract and store headings from HTML pages (utilities/file_location_listing#60)

+8 -2 (10 lines changed)
verified

mentioned in commit acc4816e936c8eedcf070088fa97380477a47949

Commit: acc4816e936c8eedcf070088fa97380477a47949 
Author: B Tasker                            
                            
Date: 2024-07-21T10:57:52.000+01:00 

Message

fix: don't include heading if it matches the title (utilities/file_location_listing#60)

If we allowed the title to be included (particularly in markdown) we'd drive up the heading index cardinality for no benefit - titles are already in the main index

+10 -2 (12 lines changed)
verified

mentioned in commit 8d0cebb8194ff38e287129f7e96e2f8b39179002

Commit: 8d0cebb8194ff38e287129f7e96e2f8b39179002 
Author: B Tasker                            
                            
Date: 2024-07-21T11:04:55.000+01:00 

Message

feat: add headings to storage file headers (utilities/file_location_listing#60)

+6 -0 (6 lines changed)

To control whether or not headings get indexed, I've added support for a new config env var (HEADINGS_ENABLED).

It defaults to y, setting it to n will disable building of the headings index (or, will once I've implemented that bit)

verified

mentioned in commit 49c3b08e18ca9b48fc1a79110bd411678960cdf5

Commit: 49c3b08e18ca9b48fc1a79110bd411678960cdf5 
Author: B Tasker                            
                            
Date: 2024-07-21T11:06:53.000+01:00 

Message

feat: add config var HEADINGS_ENABLED to determine whether to index headings (utilities/file_location_listing#60)

+3 -0 (3 lines changed)

Commit 227f5620e4a587eb9df304ca7aad900673939a5b implements creation of the index file - headings.

I don't intend to make the search code consume this quite yet. I've implemented creation of the index so that it's easy to look over and identify whether it's unnecessary cardinality of if there's some benefit to it.

verified

mentioned in commit 227f5620e4a587eb9df304ca7aad900673939a5b

Commit: 227f5620e4a587eb9df304ca7aad900673939a5b 
Author: B Tasker                            
                            
Date: 2024-07-21T11:33:15.000+01:00 

Message

feat: write collected headings to an index file (utilities/file_location_listing#60)

+32 -3 (35 lines changed)
verified

mentioned in commit 0a7f151a2712d4f936b4a2c2201e4bd8564c829c

Commit: 0a7f151a2712d4f936b4a2c2201e4bd8564c829c 
Author: B Tasker                            
                            
Date: 2024-07-21T11:37:23.000+01:00 

Message

fix: ignore heading if it appears in the title (utilities/file_location_listing#60)

Previously we skipped if it matched the title, but that meant that headings were still collected if the page title had a suffix (for example the domain).

There's no benefit to indexing a heading which appears within the title because it'll already match against the main index

+3 -3 (6 lines changed)
verified

mentioned in commit 0e5a5e1d73da4c93e1fa7c3bd6c1c560af329922

Commit: 0e5a5e1d73da4c93e1fa7c3bd6c1c560af329922 
Author: B Tasker                            
                            
Date: 2024-07-21T11:42:35.000+01:00 

Message

chore: update storeFileVersion to reflect changes made in utilities/file_location_listing#60

Deployment of this will result in a full recrawl as a result of changes to data in the storage files

+1 -4 (5 lines changed)

OK, so taking a look at heading duplication.

For the sake of time, I'm crawling a relatively small set of sites

https://snippets.bentasker.co.uk
https://shot.83n.uk/
https://recipebook.bentasker.co.uk/
https://www.bentasker.co.uk

Note: importantly, this won't currently include my notes. Should probably set another crawl running to include those (in fact, I'll take a copy of the index and run one now)

There are quite a few

ben@optimus:~/tmp$ zgrep '!#!' headings | awk -F '!#!' '{print $1}' | wc -l
3056

That's partly because I screwed up in 0a7f151a2712d4f936b4a2c2201e4bd8564c829c by forgetting to compare against a lowercased copy of the title. Fixed in 0ca3c7bc47fa05d8b9983a36e826e29193ac36a0

Rather than waiting for a recrawl, though, I'll starting by listing headings to see how many low-quality ones stand out

zgrep '!#!' headings | awk -F '!#!' '{print $1}' | less

I can see

about
aim
aims
background
categories
conclusion
contents
cooking time

I got fed up of making my eyes bleed so stopped in the c's.

zgrep '!#!' headings | python -c '
import sys
import json
items = {}

for line in sys.stdin:
        l_s = line.split("!#!")
        pages = json.loads(l_s[1])
        l = len(pages)
        if l > 1:
            items[l_s[0]] = l

s = {k: v for k, v in sorted(items.items(), key=lambda item: item[1], reverse=True)}

for k in s:
    print(f"{k}: {s[k]}")
' | head -n 20

related snippets: 1638
keywords: 317
conclusion: 227
latest posts: 197
description: 174
snippet: 172
language: 171
latest recipes: 166
categories: 144
method: 143
ingredients: 140
cooking time: 137
usage example: 127
based on: 75
project info: 59
archives: 49
read only service: 49
release notes: 45
license: 40
requires: 40

Which makes sense - the Related Snippets header is in the site template of www.bentasker.co.uk so it's going to show up everywhere.

That's easily excluded, but what happens if I add something to the template?


Automatic heading exclusion

Rather than relying solely on manual excludes (which we should still add support for), maybe we should include logic to exclude an item if it has more than x URLs associated with it?

The problem with that, though, is that headings will start dropping out as the size of the index grows. That's fine if the heading is "Introduction" but not so fine if it's because I've been writing about foo a lot

But then... that's probably not in line with the use-case laid out in the issue description: if I'm writing about foo a lot, you'd expect the titles (or at least tags) to be associated with it.

The aim of this feature was to try and catch instances where I mention bar in passing whilst writing about foo.

So, arguably, the keywords that are most interesting are the ones with the fewest associated URLs?

That does still carry the risk of losing stuff from the index - if I mention bar in passing but later gain an interest in it, we wouldn't want it that first reference to be lost from the index. But for that to happen, I'd have to be using the same title over and over which seems unlikely.

zgrep '!#!' headings | python -c '
import sys
import json


for line in sys.stdin:
        l_s = line.split("!#!")
        pages = json.loads(l_s[1])
        l = len(pages)
        if l == 1:
            print(l_s[0])
' | sort

There's a while bunch of cruft in there because of the title's screw up, but this does look like it should be more useful (capturing things like windscreen drain holes) for example.

So, I think we want 2 config sets for this

  • A max threshold: if the number of urls with this heading is over the threshold, it's left out of the index
  • A set of ignored headings: This'll allow me to quickly exclude a heading if I notice that a not-useful one has snuck in

I don't think the threshold necessarily needs to be 1, we could probably start with 5 and go from there.

verified

mentioned in commit fcca1fe910714a90d0895fda06520aea9a29db28

Commit: fcca1fe910714a90d0895fda06520aea9a29db28 
Author: B Tasker                            
                            
Date: 2024-07-21T12:29:07.000+01:00 

Message

feat: add config setting HEADINGS_MAX_URLS (utilities/file_location_listing#60)

This specifies the maximum number of URLs a heading can be associated with to be included in the headings index.

If a heading appears in more URLs than this, it will not be included in the index

+10 -1 (11 lines changed)

The crawl's not done and there are still far too many entries

$ zcat ~/tmp/search_db/headings | wc -l
15838

There are more than a few domains where skipping based on title matching isn't working because of differences in formatting.

So, I sort of wonder whether the answer might be to have the crawler not look at <h1>.

There is some scope for it to miss stuff if the title and <h1> are wildly different, but that's probably the sort of edge-case that I don't really want to have to worry about

verified

mentioned in commit 9ed71876fec26a6c66cb3693bba0a8dae16070f6

Commit: 9ed71876fec26a6c66cb3693bba0a8dae16070f6 
Author: B Tasker                            
                            
Date: 2024-07-21T12:39:22.000+01:00 

Message

fix: don't collect h1 from html pages (utilities/file_location_listing#60)

This is to minimise the likelihood of duplicating information we already have from the page title

+1 -1 (2 lines changed)

Following the big crawl, that's resulted in a pleasing drop in the size of the index

$ zcat ~/tmp/search_db/headings | wc -l
4675

It looks like that number's actually still a little high, it looks like title exclusion isn't working on some markdown files.

verified

mentioned in commit b91d802cb1d8e433c7d9977df91ab5c1d1556a6b

Commit: b91d802cb1d8e433c7d9977df91ab5c1d1556a6b 
Author: B Tasker                            
                            
Date: 2024-07-21T14:06:47.000+01:00 

Message

fix: compare lowercased heading to lowercased title (utilities/file_location_listing#60)

+1 -1 (2 lines changed)

One thing that I notice whilst looking through is there are a bunch of common tokens that I'm unlikely to search for.

For example, there are a ton of headings which start with "what is".

It probably doesn't make sense to include those phrases in the index, it's basically just wasted bytes.

So, we could perhaps have an excluded phrase list, with it's constituents being stripped out of any indexed heading (although we'd probably then need to dedupe because "What is Scunthorpe" and "Where is Scunthorpe" suddenly became the same thing).

Something to think about for later rather than something I want to implement now

verified

mentioned in commit 79155baa36191d0f81b0cba0553d536324ba4001

Commit: 79155baa36191d0f81b0cba0553d536324ba4001 
Author: B Tasker                            
                            
Date: 2024-07-21T15:08:54.000+01:00 

Message

feat: add support for ignored headings (utilities/file_location_listing#60)

+14 -2 (16 lines changed)

As we're not yet implementing support for using the generated index, I'm going to turn headings support off by default.

That way, I can enable it on my systems but anyone crazy enough to be pulling my images won't get hit by an issues it brings with it.

verified

mentioned in commit 395c0f1f294ea54137b46cf3de4a7244ec615ac3

Commit: 395c0f1f294ea54137b46cf3de4a7244ec615ac3 
Author: B Tasker                            
                            
Date: 2024-07-21T15:10:20.000+01:00 

Message

chore: turn headings support off by default (utilities/file_location_listing#60)

The ability to create the indexes is being released before the system is actually able to read them

The purpose of that is experimentation, so don't have it enabled by default

+1 -1 (2 lines changed)

OK, to summarise this issue then

  • Setting environment variable HEADINGS_ENABLED to y will enable indexing of page headings
  • Only headings with fewer than HEADINGS_MAX_URLS (default 5) associated URLs will be indexed
  • Specific headings can be ignored by adding them to config/ignoreheadings.txt
  • The resulting index is headings which has index type 2

Before closing, I'm going to update the title of this issue to reflect the fact we're only generating (but do not use the index).

I'll likely implement use of the index in a feature branch so that I can play around with it over an extended time.

changed title from {-Look at feasibility of-} indexing headings to {+Experimental:+} indexing headings

mentioned in issue #64

mentioned in issue #66