Although my RSS feed only links to HTML pages, it's not inconceivable that others may not.
Trying to pass (say) a JPG into lxml is not likely to go particularly well.
lxml
So, we could check that a page actually claims to be HTML first
assigned to @btasker
mentioned in commit 16324f5a97bbb08e47505aa4e4811a2c07581cd0
Commit: 16324f5a97bbb08e47505aa4e4811a2c07581cd0 Author: B Tasker Date: 2024-08-11T09:37:49.000+01:00
fix: check content type before attempting to parse as HTML (utilities/auto-blog-link-preserver#14)
For avoidance of doubt, this only checks when attempting to extract links from a page.
So, if we have
<item> <title>Something</title> <link>https://example.com/foo.pdf</link>
We'll still add https://example.com/foo.pdf to LinkWarden, we just won't try and extract any links from within it.
https://example.com/foo.pdf
Activity
11-Aug-24 08:35
assigned to @btasker
11-Aug-24 08:38
mentioned in commit 16324f5a97bbb08e47505aa4e4811a2c07581cd0
Message
fix: check content type before attempting to parse as HTML (utilities/auto-blog-link-preserver#14)
11-Aug-24 08:39
For avoidance of doubt, this only checks when attempting to extract links from a page.
So, if we have
We'll still add
https://example.com/foo.pdf
to LinkWarden, we just won't try and extract any links from within it.