This project is borne out of websites/BEN#25
The overall plan is to create a small docker image which periodically polls the RSS feed of a site, looking for new items.
When a new item appears, it should:
The base idea here is to preserve a copy of things that have been linked to. If those links ever go dead, the referencing page can then potentially be updated with a screenshot/quote etc.
As an additional benefit, ArchiveBox automatically submits to the Internet Archive, so it might also be possible to update with a link to the wayback machine
Activity
10-Aug-24 11:09
assigned to @btasker
10-Aug-24 11:11
mentioned in issue websites/BEN#25
10-Aug-24 11:12
The intention is that this'll be a docker image - it can then even be triggered by
cron
or run as a k8sCronJob
(which is what I'll be doing).With that in mind, I'll probably crib quite heavily from stuff that I've built in the past (in fact, I'll probably take a chunk of this almost verbatim)
10-Aug-24 11:16
mentioned in commit 0b207d048ed2d3d59173c69aed6fc8b6c1dfb90c
Message
chore: add copy of python-mastodon-rss-bot to use as a base (utilities/auto-blog-link-preserver#1)
10-Aug-24 11:41
Ah, ran into an issue - maybe I should have checked this first.
I assumed that ArchiveBox had a REST API, because there's a (broken) link to it on the docs
But, although there's one in the works it's not been included in a release yet.
So there two options here
archivebox add [url]
for each jobThe latter would allow me to stay on archivebox stable, but it'd also make this implementation kubernetes only. Not sure I want to do that.
10-Aug-24 11:44
mentioned in commit sysconfigs/bumblebee-kubernetes-charts@cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a
Message
chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)
This is to make the API available
10-Aug-24 11:52
That's worked, hitting the path
/api/v1/docs
on my instance displays the swagger docsSo, it looks like there are two options around auth:
/api/v1/auth/get_api_token
to get a token for the current userI suspect the second will create a new token each way, so we just need to support having the user give us a token.
10-Aug-24 11:54
mentioned in commit 4f1c412f8988b5831e41fa503c6a006d860140a3
Message
chore: switch to token as an auth method (utilities/auto-blog-link-preserver#1)
10-Aug-24 12:04
Ah, can't currently create an API token
well, crap
But the 0.8.0-rc tag no longer exists in the package registry.
I'll spin out a seperate issue to track temporarily creating my own image - I'll get it to pull down and build the
0.8.0-rc
tag from the repo10-Aug-24 13:28
I've closed #2 as Won't Fix
The images don't want to build because of some dependency issues.
Rather than chasing those down, it occurred to me there's a third option that I've not considered: turning off auth for link submission (Archivebox is only accessible locally anyway) and using scraping to submit.
If I can get that working, I can revisit if/when Archivebox makes its next stable release
10-Aug-24 13:31
mentioned in commit sysconfigs/bumblebee-kubernetes-charts@0ccc609a7fdf2bccb02d206bf544437910dbdba6
Message
chore: Revert "chore: Update archivebox to use the 0.8.1RC (utilities/auto-blog-link-preserver#1)"
This reverts commit cf5bf20ea03e4cd3691a4424e78b3b69f7b4742a.
The image in use is broken
10-Aug-24 13:37
Note: it didn't like the movement between versions, so I've blown away the PV to start again from scratch
10-Aug-24 17:45
I've run into some quite significant issues with ArchiveBox in utilities/auto-blog-link-preserver#6
I initially thought that issues were the result of me running on K8S with NFS backed storage, but I've moved to a docker container backed by locally attached NVMe and it's still happening.
The system goes quite unresponsive after URLs have been submitted, with further submissions (or attempts to view the UI) sometimes resulting in the container logging about lock contention
This is pretty problematic for a script that's trying to submit multiple URLs. Although I could concatenate the URLs into one long list and submit that, I'd still need to do it per referencing page (otherwise, how will I know whether to mark success locally?)
I like the look and feel of ArchiveBox, but I fear this is going to be untenable.
If I want to stick with ArchiveBox, I think the options are:
If I'm OK with moving off ArchiveBox, then I could look at alternatives, including
10-Aug-24 18:16
I don't really want to switch if I can help it.
So, lets start by making ArchiveBox do less work per URL.
The formats I'm likely to care about for this project are
The screenshot requires Chrome to be fired up, so it probably makes more sense to have single page + screenshot than it does wget + screenshot.
Stuff like PDF and a DOM dump is nice to have, but I don't really need it.
So, lets try
It's still failing sometimes, but is definitely much better than it was
10-Aug-24 18:22
The CPU's are busy as hell (all Chrome and Node) but, I think we can probably work with this: if we have the script collapse the URL submissions into a single request and then wait a bit between them (and, realistically, there'd normally only be 1 or new pages in a RSS feed per run).
Although we could see if
wget
is faster, I think we'd lose out in the long run. I picked a couple of the archived links to check and the screenshot + single page aren't much use because they both contain a tracking consent banner (that of course can't be dismisseed) - if I ever needed to grab a quote from it, the PDF and readability would be essential.10-Aug-24 18:38
Still seeing failures for 1:10 pages listed in the test RSS feed :(
I think I'm going to bite the bullet and stand up a linkwarden instance to see how that compares (jira-projects/LAN#172)
10-Aug-24 19:03
Linkwarden seems to be behaving much better (and is backed by PostgreSQL so won't run into the same lock contention issues).
I'm going to close out existing issues in this project so that I can tag/label the Archivebox related stuff appropriately before we abandon it.
I'll raise new tickets under a new milestone for making this code work with LinkWarden's API