utilities/auto-blog-link-preserver#1: Initial Build & Design



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 10-Aug-24 11:09



Description

This project is borne out of websites/BEN#25

The overall plan is to create a small docker image which periodically polls the RSS feed of a site, looking for new items.

When a new item appears, it should:

  • Fetch the HTML and extract all links
  • Feed each of those into ArchiveBox
  • Feed the page itself into ArchiveBox

The base idea here is to preserve a copy of things that have been linked to. If those links ever go dead, the referencing page can then potentially be updated with a screenshot/quote etc.

As an additional benefit, ArchiveBox automatically submits to the Internet Archive, so it might also be possible to update with a link to the wayback machine



Toggle State Changes

Activity


assigned to @btasker

mentioned in issue websites/BEN#25

The intention is that this'll be a docker image - it can then even be triggered by cron or run as a k8s CronJob (which is what I'll be doing).

With that in mind, I'll probably crib quite heavily from stuff that I've built in the past (in fact, I'll probably take a chunk of this almost verbatim)

verified

mentioned in commit 0b207d048ed2d3d59173c69aed6fc8b6c1dfb90c

Commit: 0b207d048ed2d3d59173c69aed6fc8b6c1dfb90c 
Author: B Tasker                            
                            
Date: 2024-08-10T12:16:07.000+01:00 

Message

chore: add copy of python-mastodon-rss-bot to use as a base (utilities/auto-blog-link-preserver#1)

+241 -0 (241 lines changed)

Ah, ran into an issue - maybe I should have checked this first.

I assumed that ArchiveBox had a REST API, because there's a (broken) link to it on the docs

Screenshot of archivebox docs

But, although there's one in the works it's not been included in a release yet.

So there two options here

  • Move my archivebox install to using the 0.8.1rc (0.8.0 seems to have disappeared)
  • Generate a k8s job to run archivebox add [url] for each job

The latter would allow me to stay on archivebox stable, but it'd also make this implementation kubernetes only. Not sure I want to do that.

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a

Commit: sysconfigs/bumblebee-kubernetes-charts@cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a 
Author: ben                            
                            
Date: 2024-08-10T12:43:57.000+01:00 

Message

chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)

This is to make the API available

+2 -2 (4 lines changed)

That's worked, hitting the path /api/v1/docs on my instance displays the swagger docs

archivebox api docs

So, it looks like there are two options around auth:

  • Manually click into the web interface and create an API Key
  • Call /api/v1/auth/get_api_token to get a token for the current user

I suspect the second will create a new token each way, so we just need to support having the user give us a token.

verified

mentioned in commit 4f1c412f8988b5831e41fa503c6a006d860140a3

Commit: 4f1c412f8988b5831e41fa503c6a006d860140a3 
Author: B Tasker                            
                            
Date: 2024-08-10T12:53:14.000+01:00 

Message

chore: switch to token as an auth method (utilities/auto-blog-link-preserver#1)

+2 -2 (4 lines changed)

Ah, can't currently create an API token

image

well, crap

Sorry guys :dev is under heavy active work right now, might be broken a bit as I work on the new schemas. Stick with the tagged :0.8.0-rc or :stable for now.

But the 0.8.0-rc tag no longer exists in the package registry.

I'll spin out a seperate issue to track temporarily creating my own image - I'll get it to pull down and build the 0.8.0-rc tag from the repo

I've closed #2 as Won't Fix

The images don't want to build because of some dependency issues.

Rather than chasing those down, it occurred to me there's a third option that I've not considered: turning off auth for link submission (Archivebox is only accessible locally anyway) and using scraping to submit.

If I can get that working, I can revisit if/when Archivebox makes its next stable release

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@0ccc609a7fdf2bccb02d206bf544437910dbdba6

Commit: sysconfigs/bumblebee-kubernetes-charts@0ccc609a7fdf2bccb02d206bf544437910dbdba6 
Author: ben                            
                            
Date: 2024-08-10T14:26:39.000+01:00 

Message

chore: Revert "chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)"

This reverts commit cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a.

The image in use is broken

+2 -2 (4 lines changed)

Note: it didn't like the movement between versions, so I've blown away the PV to start again from scratch

I've run into some quite significant issues with ArchiveBox in utilities/auto-blog-link-preserver#6

I initially thought that issues were the result of me running on K8S with NFS backed storage, but I've moved to a docker container backed by locally attached NVMe and it's still happening.

The system goes quite unresponsive after URLs have been submitted, with further submissions (or attempts to view the UI) sometimes resulting in the container logging about lock contention

    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.OperationalError: database is locked

This is pretty problematic for a script that's trying to submit multiple URLs. Although I could concatenate the URLs into one long list and submit that, I'd still need to do it per referencing page (otherwise, how will I know whether to mark success locally?)

I like the look and feel of ArchiveBox, but I fear this is going to be untenable.

If I want to stick with ArchiveBox, I think the options are:

  • Look at whether I can force it to use Postgres instead of sqlite: it seems to be relying on Django's ORM, so in theory it might be possible (depends on a ton of stuff though)
  • Turn most of the ArchiveBox functionality off (so that it spends less time archiving)

If I'm OK with moving off ArchiveBox, then I could look at alternatives, including

  • Reminiscence
  • LinkWarden
  • Wallabag (I already run an instance of this for other reasons, but it uses Reader View, so wouldn't be much use for screenshots)

I don't really want to switch if I can help it.

So, lets start by making ArchiveBox do less work per URL.

The formats I'm likely to care about for this project are

  • Single page (or wget, I don't really need both)
  • Screenshot

The screenshot requires Chrome to be fired up, so it probably makes more sense to have single page + screenshot than it does wget + screenshot.

Stuff like PDF and a DOM dump is nice to have, but I don't really need it.

So, lets try

-e SAVE_WARC="false" \
-e SAVE_WGET="false" \
-e SAVE_FAVICON="false" \
-e SAVE_DOM="false" \
-e SAVE_SINGLEFILE="true" \
-e SAVE_READABILITY="true" \
-e SAVE_MERCURY="false" \
-e SAVE_GIT="false" \
-e SAVE_MEDIA="false" \

It's still failing sometimes, but is definitely much better than it was

The CPU's are busy as hell (all Chrome and Node) but, I think we can probably work with this: if we have the script collapse the URL submissions into a single request and then wait a bit between them (and, realistically, there'd normally only be 1 or new pages in a RSS feed per run).

Although we could see if wget is faster, I think we'd lose out in the long run. I picked a couple of the archived links to check and the screenshot + single page aren't much use because they both contain a tracking consent banner (that of course can't be dismisseed) - if I ever needed to grab a quote from it, the PDF and readability would be essential.

Still seeing failures for 1:10 pages listed in the test RSS feed :(

I think I'm going to bite the bullet and stand up a linkwarden instance to see how that compares (jira-projects/LAN#172)

Linkwarden seems to be behaving much better (and is backed by PostgreSQL so won't run into the same lock contention issues).

I'm going to close out existing issues in this project so that I can tag/label the Archivebox related stuff appropriately before we abandon it.

I'll raise new tickets under a new milestone for making this code work with LinkWarden's API