utilities/auto-blog-link-preserver#14: Check that RSS feed items are HTML before attempting to parse



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.1
Created: 11-Aug-24 08:35



Description

Although my RSS feed only links to HTML pages, it's not inconceivable that others may not.

Trying to pass (say) a JPG into lxml is not likely to go particularly well.

So, we could check that a page actually claims to be HTML first



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 16324f5a97bbb08e47505aa4e4811a2c07581cd0

Commit: 16324f5a97bbb08e47505aa4e4811a2c07581cd0 
Author: B Tasker                            
                            
Date: 2024-08-11T09:37:49.000+01:00 

Message

fix: check content type before attempting to parse as HTML (utilities/auto-blog-link-preserver#14)

+4 -0 (4 lines changed)

For avoidance of doubt, this only checks when attempting to extract links from a page.

So, if we have

<item>
   <title>Something</title>
   <link>https://example.com/foo.pdf</link>

We'll still add https://example.com/foo.pdf to LinkWarden, we just won't try and extract any links from within it.