#1 Initial Build & Design : utilities/auto-blog-link-preserver#1

btasker Permalink
10-Aug-24 11:09

assigned to @btasker

btasker Permalink
10-Aug-24 11:11

btasker Permalink
10-Aug-24 11:12

The intention is that this'll be a docker image - it can then even be triggered by cron or run as a k8s CronJob (which is what I'll be doing).

With that in mind, I'll probably crib quite heavily from stuff that I've built in the past (in fact, I'll probably take a chunk of this almost verbatim)

btasker Permalink
10-Aug-24 11:16

verified

mentioned in commit 0b207d048ed2d3d59173c69aed6fc8b6c1dfb90c

Commit: 0b207d048ed2d3d59173c69aed6fc8b6c1dfb90c 
Author: B Tasker                            
                            
Date: 2024-08-10T12:16:07.000+01:00

Message

chore: add copy of python-mastodon-rss-bot to use as a base (utilities/auto-blog-link-preserver#1)

+241 -0 (241 lines changed)

btasker Permalink
10-Aug-24 11:41

Ah, ran into an issue - maybe I should have checked this first.

I assumed that ArchiveBox had a REST API, because there's a (broken) link to it on the docs

Screenshot of archivebox docs

But, although there's one in the works it's not been included in a release yet.

So there two options here

Move my archivebox install to using the 0.8.1rc (0.8.0 seems to have disappeared)
Generate a k8s job to run archivebox add [url] for each job

The latter would allow me to stay on archivebox stable, but it'd also make this implementation kubernetes only. Not sure I want to do that.

sysconfiguser Permalink
10-Aug-24 11:44

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a

Commit: sysconfigs/bumblebee-kubernetes-charts@cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a 
Author: ben                            
                            
Date: 2024-08-10T12:43:57.000+01:00

Message

chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)

This is to make the API available

+2 -2 (4 lines changed)

btasker Permalink
10-Aug-24 11:52

That's worked, hitting the path /api/v1/docs on my instance displays the swagger docs

archivebox api docs

So, it looks like there are two options around auth:

Manually click into the web interface and create an API Key
Call /api/v1/auth/get_api_token to get a token for the current user

I suspect the second will create a new token each way, so we just need to support having the user give us a token.

btasker Permalink
10-Aug-24 11:54

verified

mentioned in commit 4f1c412f8988b5831e41fa503c6a006d860140a3

Commit: 4f1c412f8988b5831e41fa503c6a006d860140a3 
Author: B Tasker                            
                            
Date: 2024-08-10T12:53:14.000+01:00

Message

chore: switch to token as an auth method (utilities/auto-blog-link-preserver#1)

+2 -2 (4 lines changed)

btasker Permalink
10-Aug-24 12:04

Ah, can't currently create an API token

well, crap

Sorry guys :dev is under heavy active work right now, might be broken a bit as I work on the new schemas. Stick with the tagged :0.8.0-rc or :stable for now.

But the 0.8.0-rc tag no longer exists in the package registry.

I'll spin out a seperate issue to track temporarily creating my own image - I'll get it to pull down and build the 0.8.0-rc tag from the repo

btasker Permalink
10-Aug-24 13:28

I've closed #2 as Won't Fix

The images don't want to build because of some dependency issues.

Rather than chasing those down, it occurred to me there's a third option that I've not considered: turning off auth for link submission (Archivebox is only accessible locally anyway) and using scraping to submit.

If I can get that working, I can revisit if/when Archivebox makes its next stable release

sysconfiguser Permalink
10-Aug-24 13:31

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@0ccc609a7fdf2bccb02d206bf544437910dbdba6

Commit: sysconfigs/bumblebee-kubernetes-charts@0ccc609a7fdf2bccb02d206bf544437910dbdba6 
Author: ben                            
                            
Date: 2024-08-10T14:26:39.000+01:00

Message

chore: Revert "chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)"

This reverts commit cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a.

The image in use is broken

+2 -2 (4 lines changed)

btasker Permalink
10-Aug-24 13:37

Note: it didn't like the movement between versions, so I've blown away the PV to start again from scratch

btasker Permalink
10-Aug-24 17:45

I've run into some quite significant issues with ArchiveBox in utilities/auto-blog-link-preserver#6

I initially thought that issues were the result of me running on K8S with NFS backed storage, but I've moved to a docker container backed by locally attached NVMe and it's still happening.

The system goes quite unresponsive after URLs have been submitted, with further submissions (or attempts to view the UI) sometimes resulting in the container logging about lock contention

    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.OperationalError: database is locked

This is pretty problematic for a script that's trying to submit multiple URLs. Although I could concatenate the URLs into one long list and submit that, I'd still need to do it per referencing page (otherwise, how will I know whether to mark success locally?)

I like the look and feel of ArchiveBox, but I fear this is going to be untenable.

If I want to stick with ArchiveBox, I think the options are:

Look at whether I can force it to use Postgres instead of sqlite: it seems to be relying on Django's ORM, so in theory it might be possible (depends on a ton of stuff though)
Turn most of the ArchiveBox functionality off (so that it spends less time archiving)

If I'm OK with moving off ArchiveBox, then I could look at alternatives, including

Reminiscence
LinkWarden
Wallabag (I already run an instance of this for other reasons, but it uses Reader View, so wouldn't be much use for screenshots)

btasker Permalink
10-Aug-24 18:16

I don't really want to switch if I can help it.

So, lets start by making ArchiveBox do less work per URL.

The formats I'm likely to care about for this project are

Single page (or wget, I don't really need both)
Screenshot

The screenshot requires Chrome to be fired up, so it probably makes more sense to have single page + screenshot than it does wget + screenshot.

Stuff like PDF and a DOM dump is nice to have, but I don't really need it.

So, lets try

-e SAVE_WARC="false" \
-e SAVE_WGET="false" \
-e SAVE_FAVICON="false" \
-e SAVE_DOM="false" \
-e SAVE_SINGLEFILE="true" \
-e SAVE_READABILITY="true" \
-e SAVE_MERCURY="false" \
-e SAVE_GIT="false" \
-e SAVE_MEDIA="false" \

It's still failing sometimes, but is definitely much better than it was

btasker Permalink
10-Aug-24 18:22

The CPU's are busy as hell (all Chrome and Node) but, I think we can probably work with this: if we have the script collapse the URL submissions into a single request and then wait a bit between them (and, realistically, there'd normally only be 1 or new pages in a RSS feed per run).

Although we could see if wget is faster, I think we'd lose out in the long run. I picked a couple of the archived links to check and the screenshot + single page aren't much use because they both contain a tracking consent banner (that of course can't be dismisseed) - if I ever needed to grab a quote from it, the PDF and readability would be essential.

btasker Permalink
10-Aug-24 18:38

Still seeing failures for 1:10 pages listed in the test RSS feed :(

I think I'm going to bite the bullet and stand up a linkwarden instance to see how that compares (jira-projects/LAN#172)

btasker Permalink
10-Aug-24 19:03

Linkwarden seems to be behaving much better (and is backed by PostgreSQL so won't run into the same lock contention issues).

I'm going to close out existing issues in this project so that I can tag/label the Archivebox related stuff appropriately before we abandon it.

I'll raise new tickets under a new milestone for making this code work with LinkWarden's API

utilities/auto-blog-link-preserver#1: Initial Build & Design

Issue Information

Activity