utilities/auto-blog-link-preserver#21: Script-side periodic duplicate support



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: vnext
Created: 13-Aug-24 12:11



Description

There are currently two supported modes (depending on whether duplicate protection is enabled in Linkwarden or not):

  • No duplicates
  • Create a duplicate everytime we see re-inclusion of a link

Neither are necessarily what we might want.

If, for some reason, I'm regularly linking out to https://example.com/somepage, I probably don't want 100s of copies of it to appear in Linkwarden (i.e. I want duplicate protection).

If, however, I link to it periodically (over the course of months or years), I might want duplicates because they can show how the destination has changed over time.

So, the idea is that (if this feature is enabled) the script should:

  • Query Linkwarden to see if https://example.com/somepage exists
  • If it does, check when it was last added
  • If now() - added is greater than a configured threshold, add it again

This would obviously rely on Linkwarden's duplicate protection being off (which is actually the default state).



Toggle State Changes

Activity


assigned to @btasker

There's an API endpoint for searching, but the docs don't give any details on forming searches.

Doing one in the UI results in this request

HTTP request for a Linkwarden search

URL="https://www.bentasker.co.uk/posts/blog/software-development/automatically-preserving-linked-urls-to-defend-against-link-rot.html"

curl \
-G \
-H "Authorization: Bearer $LINKWARDEN_TOKEN" \
-d "searchQueryString=${URL}" \
-d "searchByUrl=true" \
-d "searchByName=false" \
-d "searchByDescription=false" \
-d "searchByTags=false" \
-d "searchByTextContent=false" \
https://$LINKWARDEN_URL/api/v1/links | jq '.response[] | .lastPreserved'

Prints the date.

If I disable duplicate protection and submit that link a second time, running the search gives me two items (and two dates)

$ curl -s -G -H "Authorization: Bearer $LINKWARDEN_TOKEN" -d "searchQueryString=${URL}" -d "searchByUrl=true" -d "searchByName=false" -d "searchByDescription=false" -d "searchByTags=false" -d "searchByTextContent=false" https://$LINKWARDEN_URL/api/v1/links | jq '.response[] | .lastPreserved'
"2024-08-13T12:27:42.733Z"
"2024-08-12T13:00:49.539Z"

It occurs to me that, if we do this, we should also validate the collection (and tags) on anything that comes back.

If I'm using Linkwarden for other stuff and happen to drop a link in, I probably still want the script to preserve a copy in case I clear out the stuff I added at some point

There's an unexpected (but logical) caveat with this.

Whilst doing some test runs, I found that some URLs seemed to be ignoring the periodic check as a result of returning no results.

For example

https://nearlylegal.co.uk/2024/08/its-not-me-its-you-breaking-up-with-twitter/ returned 0
https://www.theregister.com/2024/07/19/google_to_kill_off_url_shortener/ returned 0

Sure enough, searching in Linkwarden for those URLs doesn't return anything.

But, if I let the script submit, Linkwarden's duplicate prevention kicks in.

The reason is, that Linkwarden is stripping the trailing slash from URLs, searching for https://nearlylegal.co.uk/2024/08/its-not-me-its-you-breaking-up-with-twitter instead gets results.

I thought it might be a result of sites serving a redirect, but no, the originals have a trailing slash.

Turns out this was spotted about 20 hours ago and fixed upstream

Looks like the fix got released in Linkwarden 2.7.0.

Have just successfully tested against it

mentioned in merge request !1

mentioned in commit ada2ce89d63980b5911b65e9d4e6ef40c83754a7

Commit: ada2ce89d63980b5911b65e9d4e6ef40c83754a7 
Author: Ben Tasker                            
                            
Date: 2024-08-18T17:12:55.000+00:00 

Message

feat: implement periodic link duplication support (utilities/auto-blog-link-preserver#21)

  • docs: document periodic duplication
  • fix: don't break if lastPreserved isn't included
  • feat: integrate support for periodic link duplication

    Note: this also adds a new stat - too_new - to record how many links were not submitted because linkwarden already has a recent copy

  • feat: add config option PERIODIC_LINK_DUPLICATION_THRESHOLD to control when links are duplicated
  • chore: finish implementing age check function
  • feat: start implementing checking of resultset
  • feat: implement ability to search linkwarden by URL
+144 -1 (145 lines changed)

mentioned in commit 66afc037f4df4410593345d1a1c85b8e76a4d386

Commit: 66afc037f4df4410593345d1a1c85b8e76a4d386 
Author: Ben Tasker                            
                            
Date: 2024-08-18T17:12:55.000+00:00 

Message

Merge branch 'feature-periodic-duplicates' into 'main'

feat: implement periodic link duplication support

Closes #21

See merge request utilities/auto-blog-link-preserver!1

+144 -1 (145 lines changed)