The crawler relies on the server's Content-Type
header when identifying whether (and how) to extract URLs from a URL.
For example:
if pd["content-type"] == "text/html":
pagecontent = BeautifulSoup(pg.content, features="lxml")
links, title, tags, published = extractUrlsFromHTML(pagecontent, url)
The attribute content-type
is a copy of the server's content-type
header caste to lowercase.
However, the server might include details of the characterset in use, for example
Content-Type: text/html; charset=utf8
This will result in pd["content-type"]
no longer matching these checks and so not extracting URLs (or metadata) from that URL
Activity
21-Jul-24 09:28
assigned to @btasker
21-Jul-24 09:28
Commit 782724d93ae619106a40e170c1dc05a7dc4ceaa8 strips the charset from this value.
21-Jul-24 09:33
mentioned in commit ec860b8c6c92bced217e33c630e71fb3003c268b
Message
feat: store charset if it's made available in the content-type header (utilities/file_location_listing#62)