#3 Submit URLs via scraping the Archivebox UI : utilities/auto-blog-link-preserver#3

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / Auto Blog Link Preserver

Milestone: 0.01 ArchiveBox Experimentation

Created: 10-Aug-24 13:31

Labels: Fixed/Done New Feature

Description

Note: This is intended to be a temporary measure

The stable release of ArchiveBox doesn't yet support the REST API, and I've not had much luck in getting a non-stable release running.

So, for the time being, the plan is to

Enable anonymous submission (i.e. set PUBLIC_ADD_VIEW to true in the deployment)
Set up some scraping to extract the CSRF token
Submit the form

It's safe(ish) for me to do this because my ArchiveBox install is not publicly accessible.

Toggle State Changes

Activity

btasker Permalink
10-Aug-24 13:31

assigned to @btasker

sysconfiguser Permalink
10-Aug-24 13:31

mentioned in commit sysconfigs/bumblebee-kubernetes-charts@4bb1eb130d2be58573f327859e8f0854e4bca100

Commit: sysconfigs/bumblebee-kubernetes-charts@4bb1eb130d2be58573f327859e8f0854e4bca100 
Author: ben                            
                            
Date: 2024-08-10T14:31:29.000+01:00

Message

feat: Enable public submission (utilities/auto-blog-link-preserver#3)

+1 -1 (2 lines changed)

btasker Permalink
10-Aug-24 13:37

Taking a packet capture of the request shows this

packet capture of archivebox add request (Ignore the 500, that's been fixed)

btasker Permalink
10-Aug-24 13:54

This isn't as simple as expected either, lxml doesn't like the HTML because there's a typo in the header

    <head>itle>Archived Sites</title>
        <title>Archived Sites</title>iewport" content="width=device-width, initial-scale=1">
        <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
        <link rel="stylesheet" href="/static/admin/css/base.css">
        <link rel="stylesheet" href="/static/admin/css/base.css">
        <link rel="stylesheet" href="/static/admin.css">in.css">
        <link rel="stylesheet" href="/static/bootstrap.min.css">
        <script src="/static/jquery.min.js"></script>
        <script src="/static/jquery.min.js"></script>
        k rel="stylesheet" href="/static/add.css" />
    <link rel="stylesheet" href="/static/add.css" />

Generating

lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 11 and head, line 17, column 12

I'll report that upstream in a bit

btasker Permalink
10-Aug-24 14:07

For now, doing

    parser = etree.XMLParser(recover=True)
    root = etree.fromstring(r.text, parser=parser)

btasker Permalink
10-Aug-24 14:07

verified

mentioned in commit 2c2b543be1bb344bc9d9961adc5cb211191a6cde

Commit: 2c2b543be1bb344bc9d9961adc5cb211191a6cde 
Author: B Tasker                            
                            
Date: 2024-08-10T15:06:49.000+01:00

Message

feat: proof-of-concept scraped submission (utilities/auto-blog-link-preserver#3)

+36 -0 (36 lines changed)

btasker Permalink
10-Aug-24 14:09

OK, the commit above implements the archivebox side of the flow

def archivebox_scrape_add_csrf():
    ''' Call the add page and extract the CSRF token
    '''
    r = SESSION.get(f"{ARCHIVE_BOX_URL}/add/")
    parser = etree.XMLParser(recover=True)
    root = etree.fromstring(r.text, parser=parser)

    form_item = root.find(".//input[@name='csrfmiddlewaretoken']")
    return form_item.attrib["value"]


def archivebox_scrap_add_url(url):
    ''' Submit a URL to archivebox
    '''

    data = {
        "csrfmiddlewaretoken": archivebox_scrape_add_csrf(),
        "url": url,
        "parser": "auto",
        "tag": "",
        "depth": 0
        }

    r = SESSION.post(f"{ARCHIVE_BOX_URL}/add/", data=data)
    return r

Invocation is

url = "https://www.bentasker.co.uk/posts/blog/house-stuff/ecover-dishwasher-tablets-left-white-grit-over-everything.html"
r = archivebox_scrap_add_url(url)
print(r.status_code)

So now, just need to work on having the script extract links from pages in the RSS feed.

btasker Permalink
10-Aug-24 14:10

mentioned in issue #2

btasker Permalink
10-Aug-24 15:17

mentioned in issue #6

utilities/auto-blog-link-preserver#3: Submit URLs via scraping the Archivebox UI

Issue Information

Activity