project project-management-only / Scraper Snitch Bot avatar

project-management-only/scraper-snitch-bot#1: robots.txt Fetch Flag



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.12
Created: 19-Jan-23 08:27



Description

After announcing the bot, a commentor gave me a good idea of something to include in the receipts files:

  • Does the bot fetch robots.txt
  • Does the bot honour robots.txt?


Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit misc/python-mastodon-bot-detection@ee5d8a45fbaea63a073e568165b64aa78d319769

Commit: misc/python-mastodon-bot-detection@ee5d8a45fbaea63a073e568165b64aa78d319769 
Author: B Tasker                            
                            
Date: 2023-01-19T17:51:33.000+00:00 

Message

Add special handling for robots.txt (project-management-only/scraper-snitch-bot#1)

We don't apply a score to requests for robots.txt (so a request on it's own will never pass subsequent scoring) but make the fact that there's been a request available to the later analysis.

This allows later processing to score based on some additional scenarios:

  1. Bot didn't request robots.txt at all
  2. Bot requested robots.txt (neutral)
  3. Bot requested other paths after fetching a robots.txt which disallows all

Scenario 1 and 3 being the big concerns

+13 -0 (13 lines changed)

The log-agent side of this is in place.

It'll score requests for /robots.txt as 0 and pass them on for inclusion in later scoring calculations. The exception to this is if the source IP is in the allowlist.

verified

mentioned in commit misc/python-mastodon-snitch-bot@b7433c67026e272f204b0219114054c7fd8fe193

Commit: misc/python-mastodon-snitch-bot@b7433c67026e272f204b0219114054c7fd8fe193 
Author: B Tasker                            
                            
Date: 2023-01-20T18:47:09.000+00:00 

Message

Exclude requests for /robots.txt from scoring for project-management-only/scraper-snitch-bot#1

The log-agent inserts requests for robots.txt so that their existence can be tested for.

However, a score of 0 is applied to these requests, which can have the effect of pulling the average score across a session down.

In practice, whether it should positively or negatively affect the score depends on whether robots.txt allows crawlers or not

+1 -0 (1 lines changed)
verified

mentioned in commit misc/python-mastodon-snitch-bot@d1ac525d95cfca5975869e171a4e439cb3447ca6

Commit: misc/python-mastodon-snitch-bot@d1ac525d95cfca5975869e171a4e439cb3447ca6 
Author: B Tasker                            
                            
Date: 2023-01-20T19:01:55.000+00:00 

Message

Add flag to denote whether the scraper fetches robots.txt (see project-management-only/scraper-snitch-bot#1)

Should be fairly self explanatory:

  • Fetches-robots.txt
  • Does-not-fetch-robots.txt
+22 -2 (24 lines changed)
verified

mentioned in commit misc/python-mastodon-snitch-bot@2edf5e140b85a02fe7daeea580a3070782c26255

Commit: misc/python-mastodon-snitch-bot@2edf5e140b85a02fe7daeea580a3070782c26255 
Author: B Tasker                            
                            
Date: 2023-01-20T19:06:21.000+00:00 

Message

Add calculation of whether a scraper ignores robots.txt (project-management-only/scraper-snitch-bot#1)

This adds an env var (ROBOTS_DISALLOW_ALL) to denote whether robots.txt disallows all bots.

If that's set and the bot made non robots.txt requests, we set the flag Ignores-robots.txt

+7 -0 (7 lines changed)