This one is currently more out of interest than any identified direct need.
As an example, if we had a markdown document at https://example.invalid/foo.txt
with the following content
# Foo Bar Foo Bar
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
### Dignissim diam quis
Amet nisl suscipit adipiscing bibendum est
### Consequat mauris nunc
Amet mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien.
We'd currently index https://example.invalid/foo.txt
and Foo Bar Foo Bar
.
Which is fine if the whole document relates to those things. But, what if Dignissim diam quis
is a complete tangent? It's not going to be represented in the index.
So, there's a reasonable argument that maybe we should index headings as well.
Activity
16-Jun-24 09:12
assigned to @btasker
16-Jun-24 09:14
Extracting headings isn't any particular bother, but it leads us onto the question of how best to index them. Do we:
Option 1 works best if we assume that (like tags) there's a high level of duplication between pages. That's certainly likely to be true to some extent (with headers like
Background
andConclusion
being pretty common).That, though, is a big part of why I'm curious to play around with this - I've no idea whether duplicated headers are the exception or the rule.
If they are the rule, then we probably want to create an
ignoretags
equivalent so that we don't bother indexing the uninteresting ones (likeConclusion
).21-Jul-24 09:49
mentioned in commit 1b6455f296fe683e6d595d40dd31a1b74f4f3580
Message
feat: extract and store headings from markdown documents (utilities/file_location_listing#60)
21-Jul-24 09:54
mentioned in commit c3311ca545c81c58c4ec0c06bad5dc0b6f54bee4
Message
feat: extract and store headings from HTML pages (utilities/file_location_listing#60)
21-Jul-24 09:59
mentioned in commit acc4816e936c8eedcf070088fa97380477a47949
Message
fix: don't include heading if it matches the title (utilities/file_location_listing#60)
If we allowed the title to be included (particularly in markdown) we'd drive up the heading index cardinality for no benefit - titles are already in the main index
21-Jul-24 10:06
mentioned in commit 8d0cebb8194ff38e287129f7e96e2f8b39179002
Message
feat: add headings to storage file headers (utilities/file_location_listing#60)
21-Jul-24 10:08
To control whether or not headings get indexed, I've added support for a new config env var (
HEADINGS_ENABLED
).It defaults to
y
, setting it ton
will disable building of the headings index (or, will once I've implemented that bit)21-Jul-24 10:08
mentioned in commit 49c3b08e18ca9b48fc1a79110bd411678960cdf5
Message
feat: add config var
HEADINGS_ENABLED
to determine whether to index headings (utilities/file_location_listing#60)21-Jul-24 10:35
Commit 227f5620e4a587eb9df304ca7aad900673939a5b implements creation of the index file -
headings
.I don't intend to make the search code consume this quite yet. I've implemented creation of the index so that it's easy to look over and identify whether it's unnecessary cardinality of if there's some benefit to it.
21-Jul-24 10:35
mentioned in commit 227f5620e4a587eb9df304ca7aad900673939a5b
Message
feat: write collected headings to an index file (utilities/file_location_listing#60)
21-Jul-24 10:38
mentioned in commit 0a7f151a2712d4f936b4a2c2201e4bd8564c829c
Message
fix: ignore heading if it appears in the title (utilities/file_location_listing#60)
Previously we skipped if it matched the title, but that meant that headings were still collected if the page title had a suffix (for example the domain).
There's no benefit to indexing a heading which appears within the title because it'll already match against the main index
21-Jul-24 10:43
mentioned in commit 0e5a5e1d73da4c93e1fa7c3bd6c1c560af329922
Message
chore: update storeFileVersion to reflect changes made in utilities/file_location_listing#60
Deployment of this will result in a full recrawl as a result of changes to data in the storage files
21-Jul-24 11:22
OK, so taking a look at heading duplication.
For the sake of time, I'm crawling a relatively small set of sites
Note: importantly, this won't currently include my notes. Should probably set another crawl running to include those (in fact, I'll take a copy of the index and run one now)
There are quite a few
That's partly because I screwed up in 0a7f151a2712d4f936b4a2c2201e4bd8564c829c by forgetting to compare against a lowercased copy of the title. Fixed in 0ca3c7bc47fa05d8b9983a36e826e29193ac36a0
Rather than waiting for a recrawl, though, I'll starting by listing headings to see how many low-quality ones stand out
I can see
I got fed up of making my eyes bleed so stopped in the c's.
Which makes sense - the
Related Snippets
header is in the site template ofwww.bentasker.co.uk
so it's going to show up everywhere.That's easily excluded, but what happens if I add something to the template?
Automatic heading exclusion
Rather than relying solely on manual excludes (which we should still add support for), maybe we should include logic to exclude an item if it has more than
x
URLs associated with it?The problem with that, though, is that headings will start dropping out as the size of the index grows. That's fine if the heading is "Introduction" but not so fine if it's because I've been writing about
foo
a lotBut then... that's probably not in line with the use-case laid out in the issue description: if I'm writing about
foo
a lot, you'd expect the titles (or at least tags) to be associated with it.The aim of this feature was to try and catch instances where I mention
bar
in passing whilst writing aboutfoo
.So, arguably, the keywords that are most interesting are the ones with the fewest associated URLs?
That does still carry the risk of losing stuff from the index - if I mention
bar
in passing but later gain an interest in it, we wouldn't want it that first reference to be lost from the index. But for that to happen, I'd have to be using the same title over and over which seems unlikely.There's a while bunch of cruft in there because of the title's screw up, but this does look like it should be more useful (capturing things like
windscreen drain holes
) for example.So, I think we want 2 config sets for this
I don't think the threshold necessarily needs to be 1, we could probably start with 5 and go from there.
21-Jul-24 11:30
mentioned in commit fcca1fe910714a90d0895fda06520aea9a29db28
Message
feat: add config setting
HEADINGS_MAX_URLS
(utilities/file_location_listing#60)This specifies the maximum number of URLs a heading can be associated with to be included in the headings index.
If a heading appears in more URLs than this, it will not be included in the index
21-Jul-24 11:38
The crawl's not done and there are still far too many entries
There are more than a few domains where skipping based on title matching isn't working because of differences in formatting.
So, I sort of wonder whether the answer might be to have the crawler not look at
<h1>
.There is some scope for it to miss stuff if the title and
<h1>
are wildly different, but that's probably the sort of edge-case that I don't really want to have to worry about21-Jul-24 11:39
mentioned in commit 9ed71876fec26a6c66cb3693bba0a8dae16070f6
Message
fix: don't collect h1 from html pages (utilities/file_location_listing#60)
This is to minimise the likelihood of duplicating information we already have from the page title
21-Jul-24 13:04
Following the big crawl, that's resulted in a pleasing drop in the size of the index
It looks like that number's actually still a little high, it looks like title exclusion isn't working on some markdown files.
21-Jul-24 13:07
mentioned in commit b91d802cb1d8e433c7d9977df91ab5c1d1556a6b
Message
fix: compare lowercased heading to lowercased title (utilities/file_location_listing#60)
21-Jul-24 13:18
One thing that I notice whilst looking through is there are a bunch of common tokens that I'm unlikely to search for.
For example, there are a ton of headings which start with "what is".
It probably doesn't make sense to include those phrases in the index, it's basically just wasted bytes.
So, we could perhaps have an excluded phrase list, with it's constituents being stripped out of any indexed heading (although we'd probably then need to dedupe because "What is Scunthorpe" and "Where is Scunthorpe" suddenly became the same thing).
Something to think about for later rather than something I want to implement now
21-Jul-24 14:09
mentioned in commit 79155baa36191d0f81b0cba0553d536324ba4001
Message
feat: add support for ignored headings (utilities/file_location_listing#60)
21-Jul-24 14:10
As we're not yet implementing support for using the generated index, I'm going to turn headings support off by default.
That way, I can enable it on my systems but anyone crazy enough to be pulling my images won't get hit by an issues it brings with it.
21-Jul-24 14:11
mentioned in commit 395c0f1f294ea54137b46cf3de4a7244ec615ac3
Message
chore: turn headings support off by default (utilities/file_location_listing#60)
The ability to create the indexes is being released before the system is actually able to read them
The purpose of that is experimentation, so don't have it enabled by default
21-Jul-24 14:14
OK, to summarise this issue then
HEADINGS_ENABLED
toy
will enable indexing of page headingsHEADINGS_MAX_URLS
(default 5) associated URLs will be indexedconfig/ignoreheadings.txt
headings
which has index type 2Before closing, I'm going to update the title of this issue to reflect the fact we're only generating (but do not use the index).
I'll likely implement use of the index in a feature branch so that I can play around with it over an extended time.
21-Jul-24 14:14
changed title from {-Look at feasibility of-} indexing headings to {+Experimental:+} indexing headings
21-Jul-24 14:17
mentioned in issue #64
23-Aug-24 09:49
mentioned in issue #66