project Utilities / File Location Listing avatar

utilities/file_location_listing#63: Index may include sites outside the allowed list



Issue Information

Issue Type: issue
Status: opened
Reported By: btasker
Assigned To: btasker

Milestone: backlog
Created: 21-Jul-24 13:01



Description

Generally we'd expect that the index will only contain urls from domains included in sites.txt.

However, it is possible for odd pages from other domains to appear.

For example:

  • I used to use Joomla's weblinks component to provide a link to my myprofile on another site.
  • When I migrated off Joomla I set up redirects in Nginx for the old weblinks paths

Those old weblinks paths are still linked to from within some pages.

When the crawler is crawling, it'll see something like https://www.bentasker.co.uk/photographyprofiles?id=foo, which will pass the sites allowed check.

When it requests it though, it'll receive a redirect, which it'll follow.

Because we don't re-check whether the URL should be crawled, that page will end up being indexed. None of it's internal links will be followed (because they will fail the shouldCrawl check).



Toggle State Changes

Activity


assigned to @btasker

I'm putting this in backlog for now, because I'm undecided whether this is an issue or should just be considered informational

  • On the one hand, we're accessing a site outside the permitted list.
  • On the other, it's a single page

If I added a shouldCrawl recheck just after the page has been fetched, we'd still have fetched the page from that site.

Technically, we could tell requests not to follow redirects, check the url and then follow only if in the allowed list, but that adds a lot of complexity and would break our current behaviour (when a redirect is followed, records for the original url are purged).