utilities/auto-blog-link-preserver#3: Submit URLs via scraping the Archivebox UI



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 10-Aug-24 13:31



Description

Note: This is intended to be a temporary measure

The stable release of ArchiveBox doesn't yet support the REST API, and I've not had much luck in getting a non-stable release running.

So, for the time being, the plan is to

  • Enable anonymous submission (i.e. set PUBLIC_ADD_VIEW to true in the deployment)
  • Set up some scraping to extract the CSRF token
  • Submit the form

It's safe(ish) for me to do this because my ArchiveBox install is not publicly accessible.



Toggle State Changes

Activity


assigned to @btasker

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@4bb1eb130d2be58573f327859e8f0854e4bca100

Commit: sysconfigs/bumblebee-kubernetes-charts@4bb1eb130d2be58573f327859e8f0854e4bca100 
Author: ben                            
                            
Date: 2024-08-10T14:31:29.000+01:00 

Message

feat: Enable public submission (utilities/auto-blog-link-preserver#3)

+1 -1 (2 lines changed)

Taking a packet capture of the request shows this

packet capture of archivebox add request (Ignore the 500, that's been fixed)

This isn't as simple as expected either, lxml doesn't like the HTML because there's a typo in the header

    <head>itle>Archived Sites</title>
        <title>Archived Sites</title>iewport" content="width=device-width, initial-scale=1">
        <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
        <link rel="stylesheet" href="/static/admin/css/base.css">
        <link rel="stylesheet" href="/static/admin/css/base.css">
        <link rel="stylesheet" href="/static/admin.css">in.css">
        <link rel="stylesheet" href="/static/bootstrap.min.css">
        <script src="/static/jquery.min.js"></script>
        <script src="/static/jquery.min.js"></script>
        k rel="stylesheet" href="/static/add.css" />
    <link rel="stylesheet" href="/static/add.css" />

Generating

lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 11 and head, line 17, column 12

I'll report that upstream in a bit

For now, doing

    parser = etree.XMLParser(recover=True)
    root = etree.fromstring(r.text, parser=parser)
verified

mentioned in commit 2c2b543be1bb344bc9d9961adc5cb211191a6cde

Commit: 2c2b543be1bb344bc9d9961adc5cb211191a6cde 
Author: B Tasker                            
                            
Date: 2024-08-10T15:06:49.000+01:00 

Message

feat: proof-of-concept scraped submission (utilities/auto-blog-link-preserver#3)

+36 -0 (36 lines changed)

OK, the commit above implements the archivebox side of the flow

def archivebox_scrape_add_csrf():
    ''' Call the add page and extract the CSRF token
    '''
    r = SESSION.get(f"{ARCHIVE_BOX_URL}/add/")
    parser = etree.XMLParser(recover=True)
    root = etree.fromstring(r.text, parser=parser)

    form_item = root.find(".//input[@name='csrfmiddlewaretoken']")
    return form_item.attrib["value"]


def archivebox_scrap_add_url(url):
    ''' Submit a URL to archivebox
    '''

    data = {
        "csrfmiddlewaretoken": archivebox_scrape_add_csrf(),
        "url": url,
        "parser": "auto",
        "tag": "",
        "depth": 0
        }

    r = SESSION.post(f"{ARCHIVE_BOX_URL}/add/", data=data)
    return r

Invocation is

url = "https://www.bentasker.co.uk/posts/blog/house-stuff/ecover-dishwasher-tablets-left-white-grit-over-everything.html"
r = archivebox_scrap_add_url(url)
print(r.status_code)

So now, just need to work on having the script extract links from pages in the RSS feed.

mentioned in issue #2

mentioned in issue #6