#62 File types aren't correctly recognised if the server includes a charset : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: v0.2.8

Created: 21-Jul-24 09:28

Labels: Bug Fixed/Done

Description

The crawler relies on the server's Content-Type header when identifying whether (and how) to extract URLs from a URL.

For example:

if pd["content-type"] == "text/html":
                pagecontent = BeautifulSoup(pg.content, features="lxml")
                links, title, tags, published = extractUrlsFromHTML(pagecontent, url)

The attribute content-type is a copy of the server's content-type header caste to lowercase.

However, the server might include details of the characterset in use, for example

Content-Type: text/html; charset=utf8

This will result in pd["content-type"] no longer matching these checks and so not extracting URLs (or metadata) from that URL

Toggle State Changes

Activity

btasker Permalink
21-Jul-24 09:28

assigned to @btasker

btasker Permalink
21-Jul-24 09:28

Commit 782724d93ae619106a40e170c1dc05a7dc4ceaa8 strips the charset from this value.

btasker Permalink
21-Jul-24 09:33

verified

mentioned in commit ec860b8c6c92bced217e33c630e71fb3003c268b

Commit: ec860b8c6c92bced217e33c630e71fb3003c268b 
Author: B Tasker                            
                            
Date: 2024-07-21T10:32:37.000+01:00

Message

feat: store charset if it's made available in the content-type header (utilities/file_location_listing#62)

+6 -1 (7 lines changed)

utilities/file_location_listing#62: File types aren't correctly recognised if the server includes a charset

Issue Information

Activity