utilities/auto-blog-link-preserver#4: Extract URLs from pages listed in feed



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 10-Aug-24 14:12



Description

Now that we have the ability to submit into Archivebox, we need to generate a list of urls to submit.

These will be

  • The URL of the page in the feed itself
  • Each URL linked out to from within that page

One thing that needs to be decided is whether point 2 should be all URLs (i.e. include internal links) or only external links.

It's worth noting too that ArchiveBox supports bulk submission of URLs - they just need to be seperated by CRLF when being passed into archivebox_scrap_add_url()



Toggle State Changes

Activity


assigned to @btasker

changed the description

verified

mentioned in commit 6ca6bb0d2e4fe0060fa889ebf5d9f8a8bfd86be9

Commit: 6ca6bb0d2e4fe0060fa889ebf5d9f8a8bfd86be9 
Author: B Tasker                            
                            
Date: 2024-08-10T15:26:19.000+01:00 

Message

feat: extract links from page content (utilities/auto-blog-link-preserver#4)

+37 -3 (40 lines changed)

The commit above looks for a tags with a href attribute.

This works, but extracts a bunch of unwanted stuff too - because it's picking up on the links used for social icons etc.

So, it'd probably be worth adding the ability to provide an xpath filter to use to identify an element to search within

For example, on bentasker.co.uk, article content lives within this div

<article class="post-text h-entry hentry postpage" itemscope="itemscope" itemtype="http://schema.org/Article">

So we'd want to search within that.

As it would vary per site, this should be a per-feed option in the feeds config file

verified

mentioned in commit ddd9f0a5adeadab725fb3e09f429e8b348ac9d5a

Commit: ddd9f0a5adeadab725fb3e09f429e8b348ac9d5a 
Author: B Tasker                            
                            
Date: 2024-08-10T15:54:04.000+01:00 

Message

feat: introduce ability to specify an XPath filter per feed (utilities/auto-blog-link-preserver#4)

+15 -6 (21 lines changed)

Took me a bit of fiddling about to build the right xpath for my site, but the above commit provides the ability to define one per feed.

That won't help, of course, if the feed being monitoring links out to lots of different sites. But, that's not something I'm planning on doing, so I'm going to ignore that for now.

mentioned in issue #6