Generally we'd expect that the index will only contain urls from domains included in sites.txt
.
However, it is possible for odd pages from other domains to appear.
For example:
Those old weblinks paths are still linked to from within some pages.
When the crawler is crawling, it'll see something like https://www.bentasker.co.uk/photographyprofiles?id=foo
, which will pass the sites allowed check.
When it requests it though, it'll receive a redirect, which it'll follow.
Because we don't re-check whether the URL should be crawled, that page will end up being indexed. None of it's internal links will be followed (because they will fail the shouldCrawl check).
Activity
21-Jul-24 13:01
assigned to @btasker
21-Jul-24 13:04
I'm putting this in backlog for now, because I'm undecided whether this is an issue or should just be considered informational
If I added a shouldCrawl recheck just after the page has been fetched, we'd still have fetched the page from that site.
Technically, we could tell requests not to follow redirects, check the url and then follow only if in the allowed list, but that adds a lot of complexity and would break our current behaviour (when a redirect is followed, records for the original url are purged).