utilities/auto-blog-link-preserver#6: Submit extracted URLs into Archivebox



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 10-Aug-24 15:17



Description

This should just be a case of stitching the work from #3 and #4 together

URLs that have been extracted from pages need to be submitted into ArchiveBox



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 94e2eb6268234baa45a62a34a083cfab6a353d81

Commit: 94e2eb6268234baa45a62a34a083cfab6a353d81 
Author: B Tasker                            
                            
Date: 2024-08-10T16:17:47.000+01:00 

Message

feat: submit URLs into archivebox (and apply a timeout) - utilities/auto-blog-link-preserver#6

+18 -3 (21 lines changed)

The commit above introduces submission.

However, requests to ArchiveBox seem to hang sometimes and we end up blocked.

So, I added a timeout to unblock us.

At the moment we proceed onto the next page, so would retry a URL on the next failure.

However, that may not be the correct behaviour:

  • firstly, the next set of URLs tend to fail (I think I've probably overloaded ArchiveBox).
  • Secondly, it looks like the URLs are accepted into ArchiveBox, so a subsequent submission just creates another copy. So, if there's a page with lots of links, they might fail and create archivebox copies every time

It's not entirely clear why archivebox seems to lock up, it's not using much CPU

ArchiveBox CPU usage

It looks like the container relies on Django runserver

Yep, container logs confirm it

August 10, 2024 - 13:36:15
Django version 3.1.14, using settings 'core.settings'
Starting development server at http://0.0.0.0:8000/
Quit the server with CONTROL-C.

That should be multi-threaded though

Ah, looks like it's been quietly crapping itself in the background

"POST /add/ HTTP/1.1" 200 4158
Internal Server Error: /add/
Traceback (most recent call last):
  File "/app/archivebox/index/sql.py", line 48, in write_link_to_sql_index
    info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.Snapshot.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.Snapshot.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: database is locked

The interface is broken at the moment, so have had to roll the pod

It's not entirely clear why archivebox seems to lock up, it's not using much CPU

Inevitably, it looks like the issue is I/O - the pod's backed by NFS, looks like it's not fast enough for it.

Found this info in the ArchiveBox wiki

The data/archive/ subfolder contains the bulk archived content, and it supports being stored on a slower remote server (SMB/NFS/SFTP/etc.) or object store (S3/B2/R2/etc.). For data integrity and performance reasons, the rest of the data/ directory (data/ArchiveBox.conf, data/logs, etc.) must be stored locally while ArchiveBox is running.

mea culpa

OK, I'll come back to fixing my AB install in a bit then, lets finish submission off

verified

mentioned in commit dd7800fb0f597d2ec85e59e947e814ac346c687e

Commit: dd7800fb0f597d2ec85e59e947e814ac346c687e 
Author: B Tasker                            
                            
Date: 2024-08-10T17:03:53.000+01:00 

Message

feat: submit URLs into archivebox (utilities/auto-blog-link-preserver#6)

Backoff if things seem to be failing

+17 -33 (50 lines changed)

Swapped to running in a docker container backed by an NVME.

Submission shows the same issues.

Interestingly, looking at developer tools when doing it through the Web UI, that doesn't seem to bother to wait for a response:

Developer tools of submitting a URL into archivebox

The initiator there shows as being some JS on the add page. Sure enough, it doesn't wait

              document.getElementById('add-form').addEventListener('submit', function(event) {
                    document.getElementById('in-progress').style.display = 'block'
                    document.getElementById('add-form').style.display = 'none'
                    document.getElementById('delay-warning').style.display = 'block'
                    setTimeout(function() {
                        window.location = '/'
                    }, 2000)
                    return true
                })

So, although it pains me to do it, it looks like the answer here is actually to fire and forget, because that's what the front-end does.

Even with the NVME backed setup, I'm still seeing 502s.

I guess the answer is that we're trying to push too much in at once, even with a 1s pause between successful ones (or 20 after failure).

Container shows lock contention again

    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.OperationalError: database is locked

mentioned in issue #1

verified

mentioned in commit 60c4346afcd3dd302c93d4969357a1a3dd2be723

Commit: 60c4346afcd3dd302c93d4969357a1a3dd2be723 
Author: B Tasker                            
                            
Date: 2024-08-10T18:26:23.000+01:00 

Message

fix: treat a timeout as success (utilities/auto-blog-link-preserver#6)

This replicates what the UI code does.

Later on, we should improve this - validating that the URL was added.

However, as the intent is switch to the REST api once available, accepting this for now

+17 -4 (21 lines changed)