project Utilities / File Location Listing avatar

utilities/file_location_listing#62: File types aren't correctly recognised if the server includes a charset



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.8
Created: 21-Jul-24 09:28



Description

The crawler relies on the server's Content-Type header when identifying whether (and how) to extract URLs from a URL.

For example:

if pd["content-type"] == "text/html":
                pagecontent = BeautifulSoup(pg.content, features="lxml")
                links, title, tags, published = extractUrlsFromHTML(pagecontent, url)

The attribute content-type is a copy of the server's content-type header caste to lowercase.

However, the server might include details of the characterset in use, for example

Content-Type: text/html; charset=utf8

This will result in pd["content-type"] no longer matching these checks and so not extracting URLs (or metadata) from that URL



Toggle State Changes

Activity


assigned to @btasker

Commit 782724d93ae619106a40e170c1dc05a7dc4ceaa8 strips the charset from this value.

verified

mentioned in commit ec860b8c6c92bced217e33c630e71fb3003c268b

Commit: ec860b8c6c92bced217e33c630e71fb3003c268b 
Author: B Tasker                            
                            
Date: 2024-07-21T10:32:37.000+01:00 

Message

feat: store charset if it's made available in the content-type header (utilities/file_location_listing#62)

+6 -1 (7 lines changed)