Now that we have the ability to submit into Archivebox, we need to generate a list of urls to submit.
These will be
One thing that needs to be decided is whether point 2 should be all URLs (i.e. include internal links) or only external links.
It's worth noting too that ArchiveBox supports bulk submission of URLs - they just need to be seperated by CRLF when being passed into archivebox_scrap_add_url()
Activity
10-Aug-24 14:12
assigned to @btasker
10-Aug-24 14:14
changed the description
10-Aug-24 14:26
mentioned in commit 6ca6bb0d2e4fe0060fa889ebf5d9f8a8bfd86be9
Message
feat: extract links from page content (utilities/auto-blog-link-preserver#4)
10-Aug-24 14:29
The commit above looks for
a
tags with ahref
attribute.This works, but extracts a bunch of unwanted stuff too - because it's picking up on the links used for social icons etc.
So, it'd probably be worth adding the ability to provide an xpath filter to use to identify an element to search within
For example, on
bentasker.co.uk
, article content lives within this divSo we'd want to search within that.
As it would vary per site, this should be a per-feed option in the feeds config file
10-Aug-24 14:54
mentioned in commit ddd9f0a5adeadab725fb3e09f429e8b348ac9d5a
Message
feat: introduce ability to specify an XPath filter per feed (utilities/auto-blog-link-preserver#4)
10-Aug-24 14:55
Took me a bit of fiddling about to build the right xpath for my site, but the above commit provides the ability to define one per feed.
That won't help, of course, if the feed being monitoring links out to lots of different sites. But, that's not something I'm planning on doing, so I'm going to ignore that for now.
10-Aug-24 15:17
mentioned in issue #6