However, requests to ArchiveBox seem to hang sometimes and we end up blocked.
So, I added a timeout to unblock us.
At the moment we proceed onto the next page, so would retry a URL on the next failure.
However, that may not be the correct behaviour:
firstly, the next set of URLs tend to fail (I think I've probably overloaded ArchiveBox).
Secondly, it looks like the URLs are accepted into ArchiveBox, so a subsequent submission just creates another copy. So, if there's a page with lots of links, they might fail and create archivebox copies every time
August 10, 2024 - 13:36:15
Django version 3.1.14, using settings 'core.settings'
Starting development server at http://0.0.0.0:8000/
Quit the server with CONTROL-C.
Ah, looks like it's been quietly crapping itself in the background
"POST /add/ HTTP/1.1" 200 4158
Internal Server Error: /add/
Traceback (most recent call last):
File "/app/archivebox/index/sql.py", line 48, in write_link_to_sql_index
info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 429, in get
raise self.model.DoesNotExist(
core.models.Snapshot.DoesNotExist: Snapshot matching query does not exist.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 589, in update_or_create
obj = self.select_for_update().get(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 429, in get
raise self.model.DoesNotExist(
core.models.Snapshot.DoesNotExist: Snapshot matching query does not exist.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
return Database.Cursor.execute(self, query, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: database is locked
The interface is broken at the moment, so have had to roll the pod
The data/archive/ subfolder contains the bulk archived content, and it supports being stored on a slower remote server (SMB/NFS/SFTP/etc.) or object store (S3/B2/R2/etc.). For data integrity and performance reasons, the rest of the data/ directory (data/ArchiveBox.conf, data/logs, etc.) must be stored locally while ArchiveBox is running.
Even with the NVME backed setup, I'm still seeing 502s.
I guess the answer is that we're trying to push too much in at once, even with a 1s pause between successful ones (or 20 after failure).
Container shows lock contention again
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/local/lib/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
return Database.Cursor.execute(self, query, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.OperationalError: database is locked
Activity
10-Aug-24 15:17
assigned to @btasker
10-Aug-24 15:18
mentioned in commit 94e2eb6268234baa45a62a34a083cfab6a353d81
Message
feat: submit URLs into archivebox (and apply a timeout) - utilities/auto-blog-link-preserver#6
10-Aug-24 15:20
The commit above introduces submission.
However, requests to ArchiveBox seem to hang sometimes and we end up blocked.
So, I added a timeout to unblock us.
At the moment we proceed onto the next page, so would retry a URL on the next failure.
However, that may not be the correct behaviour:
10-Aug-24 15:32
It's not entirely clear why archivebox seems to lock up, it's not using much CPU
It looks like the container relies on Django runserver
Yep, container logs confirm it
That should be multi-threaded though
10-Aug-24 15:45
Ah, looks like it's been quietly crapping itself in the background
The interface is broken at the moment, so have had to roll the pod
10-Aug-24 15:50
Inevitably, it looks like the issue is I/O - the pod's backed by NFS, looks like it's not fast enough for it.
Found this info in the ArchiveBox wiki
mea culpa
10-Aug-24 15:57
OK, I'll come back to fixing my AB install in a bit then, lets finish submission off
10-Aug-24 16:04
mentioned in commit dd7800fb0f597d2ec85e59e947e814ac346c687e
Message
feat: submit URLs into archivebox (utilities/auto-blog-link-preserver#6)
Backoff if things seem to be failing
10-Aug-24 17:23
Swapped to running in a docker container backed by an NVME.
Submission shows the same issues.
Interestingly, looking at developer tools when doing it through the Web UI, that doesn't seem to bother to wait for a response:
The initiator there shows as being some JS on the add page. Sure enough, it doesn't wait
So, although it pains me to do it, it looks like the answer here is actually to fire and forget, because that's what the front-end does.
10-Aug-24 17:29
Even with the NVME backed setup, I'm still seeing 502s.
I guess the answer is that we're trying to push too much in at once, even with a 1s pause between successful ones (or 20 after failure).
Container shows lock contention again
10-Aug-24 17:45
mentioned in issue #1
10-Aug-24 18:23
mentioned in commit 60c4346afcd3dd302c93d4969357a1a3dd2be723
Message
fix: treat a timeout as success (utilities/auto-blog-link-preserver#6)
This replicates what the UI code does.
Later on, we should improve this - validating that the URL was added.
However, as the intent is switch to the REST api once available, accepting this for now