#14 Check that RSS feed items are HTML before attempting to parse : utilities/auto-blog-link-preserver#14

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / Auto Blog Link Preserver

Milestone: v0.1

Created: 11-Aug-24 08:35

Labels: Fixed/Done Improvement

Description

Although my RSS feed only links to HTML pages, it's not inconceivable that others may not.

Trying to pass (say) a JPG into lxml is not likely to go particularly well.

So, we could check that a page actually claims to be HTML first

Toggle State Changes

Activity

btasker Permalink
11-Aug-24 08:35

assigned to @btasker

btasker Permalink
11-Aug-24 08:38

verified

mentioned in commit 16324f5a97bbb08e47505aa4e4811a2c07581cd0

Commit: 16324f5a97bbb08e47505aa4e4811a2c07581cd0 
Author: B Tasker                            
                            
Date: 2024-08-11T09:37:49.000+01:00

Message

fix: check content type before attempting to parse as HTML (utilities/auto-blog-link-preserver#14)

+4 -0 (4 lines changed)

btasker Permalink
11-Aug-24 08:39

For avoidance of doubt, this only checks when attempting to extract links from a page.

So, if we have

<item>
   <title>Something</title>
   <link>https://example.com/foo.pdf</link>

We'll still add https://example.com/foo.pdf to LinkWarden, we just won't try and extract any links from within it.

utilities/auto-blog-link-preserver#14: Check that RSS feed items are HTML before attempting to parse

Issue Information

Activity