We don't apply a score to requests for robots.txt (so a request on it's own will never pass subsequent scoring) but make
the fact that there's been a request available to the later analysis.
This allows later processing to score based on some additional scenarios:
Bot didn't request robots.txt at all
Bot requested robots.txt (neutral)
Bot requested other paths after fetching a robots.txt which disallows all
It'll score requests for /robots.txt as 0 and pass them on for inclusion in later scoring calculations. The exception to this is if the source IP is in the allowlist.
Activity
19-Jan-23 08:27
assigned to @btasker
19-Jan-23 17:56
mentioned in commit misc/python-mastodon-bot-detection@ee5d8a45fbaea63a073e568165b64aa78d319769
Message
Add special handling for robots.txt (project-management-only/scraper-snitch-bot#1)
We don't apply a score to requests for robots.txt (so a request on it's own will never pass subsequent scoring) but make the fact that there's been a request available to the later analysis.
This allows later processing to score based on some additional scenarios:
robots.txt
at allrobots.txt
(neutral)robots.txt
which disallows allScenario 1 and 3 being the big concerns
19-Jan-23 18:11
The
log-agent
side of this is in place.It'll score requests for
/robots.txt
as0
and pass them on for inclusion in later scoring calculations. The exception to this is if the source IP is in the allowlist.20-Jan-23 18:49
mentioned in commit misc/python-mastodon-snitch-bot@b7433c67026e272f204b0219114054c7fd8fe193
Message
Exclude requests for /robots.txt from scoring for project-management-only/scraper-snitch-bot#1
The log-agent inserts requests for robots.txt so that their existence can be tested for.
However, a score of 0 is applied to these requests, which can have the effect of pulling the average score across a session down.
In practice, whether it should positively or negatively affect the score depends on whether robots.txt allows crawlers or not
20-Jan-23 19:03
mentioned in commit misc/python-mastodon-snitch-bot@d1ac525d95cfca5975869e171a4e439cb3447ca6
Message
Add flag to denote whether the scraper fetches robots.txt (see project-management-only/scraper-snitch-bot#1)
Should be fairly self explanatory:
Fetches-robots.txt
Does-not-fetch-robots.txt
20-Jan-23 19:06
mentioned in commit misc/python-mastodon-snitch-bot@2edf5e140b85a02fe7daeea580a3070782c26255
Message
Add calculation of whether a scraper ignores robots.txt (project-management-only/scraper-snitch-bot#1)
This adds an env var (
ROBOTS_DISALLOW_ALL
) to denote whether robots.txt disallows all bots.If that's set and the bot made non
robots.txt
requests, we set the flagIgnores-robots.txt