#63 Index may include sites outside the allowed list : utilities/file_location

Issue Type: issue

Status: opened

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: backlog

Created: 21-Jul-24 13:01

Description

Generally we'd expect that the index will only contain urls from domains included in sites.txt.

However, it is possible for odd pages from other domains to appear.

For example:

I used to use Joomla's weblinks component to provide a link to my myprofile on another site.
When I migrated off Joomla I set up redirects in Nginx for the old weblinks paths

Those old weblinks paths are still linked to from within some pages.

When the crawler is crawling, it'll see something like https://www.bentasker.co.uk/photographyprofiles?id=foo, which will pass the sites allowed check.

When it requests it though, it'll receive a redirect, which it'll follow.

Because we don't re-check whether the URL should be crawled, that page will end up being indexed. None of it's internal links will be followed (because they will fail the shouldCrawl check).

Toggle State Changes

Activity

btasker Permalink
21-Jul-24 13:01

assigned to @btasker

btasker Permalink
21-Jul-24 13:04

I'm putting this in backlog for now, because I'm undecided whether this is an issue or should just be considered informational

On the one hand, we're accessing a site outside the permitted list.
On the other, it's a single page

If I added a shouldCrawl recheck just after the page has been fetched, we'd still have fetched the page from that site.

Technically, we could tell requests not to follow redirects, check the url and then follow only if in the allowed list, but that adds a lot of complexity and would break our current behaviour (when a redirect is followed, records for the original url are purged).

utilities/file_location_listing#63: Index may include sites outside the allowed list

Issue Information

Activity