There have been a couple of instances now of IPs being flagged because they're running Fedifetcher, for example
### Overview
Observed Requests: 15
First Seen: 2024-06-27 18:46:51 (UTC)
Last Seen: 2024-08-16 18:05:08 (UTC)
Average number of daily requests: 2.142857142857143
----
### Observed Useragents
- FediFetcher/7.1.2; +<snipped> (https://go.thms.uk/ff)
- Mastodon/4.3.0-alpha.5+glitch.0813_115fb0a (http.rb/5.2.0; <snipped>)
- FediFetcher/7.1.3; +<snipped> (https://go.thms.uk/ff)
----
### Observed Paths
- /.well-known/webfinger
- /users/scrapersnitch/collections/tags
- /users/scrapersnitch/followers
- /users/scrapersnitch
- /users/scrapersnitch/collections/featured
- /users/scrapersnitch/following
- /robots.txt
- /users/scrapersnitch/outbox
For a Summary of path sensitivity see https://projects.bentasker.co.uk/gils_projects/wiki/project-management-only/scraper-snitch-bot/page/Request-Paths.html
----
### Flags
- Fetches-robots.txt
- Ignores-robots.txt
- Does-not-fetch-robots.txt
This is undesirable - FediFetcher's a tool to fetch missing replies etc from toots.
The problem is, I made an adjustment the other day to account for this, but the instance above still got flagged up. Need to look at scoring to see why (I'm guessing it's the score resulting from it ignoring robots.txt
).
Edit: an overview of FediFetcher has been published on the wiki to try and help admins assess whether its something that they want to permit.
Activity
17-Aug-24 22:44
assigned to @btasker
17-Aug-24 22:45
changed the description
17-Aug-24 22:50
Probably worth checking the upstream buglist. https://blog.thms.uk/fedifetcher says
It does appear to be ignoring it (although, perhaps the other paths we're seeing aren't requested by the Fedifetcher UA - it's a distinct python script - need to check that too).
FTR, my robots.txt is
18-Aug-24 08:03
OK, so lets looks at the flow of requests for this IP:
The masto instance fetches a status:
Then Fedifetcher appears and so fetches
robots.txt
Later in the day there are subsequent requests, but all made with the Masto UA - i.e. FediFetcher has honoured
robots.txt
.18-Aug-24 08:14
So,
FediFetcher
honoursrobots.txt
, however the scoring looks at whetherrobots.txt
is being ignored on a per IP basis:Which does make some sense.
The user-agent header is very much under the client's control and we wouldn't want detection to miss out on a malicious bot which sends a different UA each time.
Funnily enough, the FediFetcher Issue tracker contains a link to an example of where this can happen without it being under the users control:
As noted in the GH, there is a way to work around it, which FF have obviously used as the logs contain a correct UA.
So, the first problem here is essentially a philosophical one: should
robots.txt
tracking be per ip or per IP and UA?The latter risks missing things whilst the former leads us, well, here.
18-Aug-24 08:22
We'll figure that out in a bit, first lets check scoring to make sure that that is the issue (and the only issue).
The IPs loglines have been written into
~/sample.log
, soScore/request
The
robots.txt
is (as expected) neutral.Every request after that, though, had 40 added onto it because the IP was perceived to be ignoring
robots.txt
They picked up another
10
for not having a referrer.That 10 is normally a harmless score, but combining that with ignoring
robots.txt
gets us into notifiable territory.None of the requests picked up any other flags.
18-Aug-24 08:37
A follow on question though, is why my previous intervention didn't help.
I added a UA rule which should have biased the scoring to the extent that it didn't fire. Yet, the sample doesn't get tagged with the
ua_allow
flag.Ahhh, but it wouldn't...
The only request we see from the allowed UA is for
robots.txt
. That's a 0-rated request and never proceeds on to the subsequent processing.The requests after that all use the UA string
Mastodon/4.3.0-alpha.5+glitch.0813_115fb0a (http.rb/5.2.0; +<redacted>)
This UA structure is actually quite unusual. The Glitch UA normally looks more like
Critical bits there are:
Mastodon/
glitch
Neither of these is true in this client's UA string.
To be fair, it's not just that instance,
mastodon.social
is rocking a different format as wellJust to show how unusual that is, this is a count of loglines, grepping for the two different forms
It's even worse for glitch
So, it's probably not too surprising that those requests flagged up as unusual.
18-Aug-24 08:39
I've adjusted the ruleset config to be more permissive of this.
18-Aug-24 09:23
The changes made will help with this specific incident (and ones like it).
I've gone through receipts and toots and remove the associated IPs (there were 2, 3 including the one I manually cleared out a couple of weeks ago)
For reference, the subnets of the affected IPs were:
This does leave us with the one outstanding question:
I think there's a reasonable argument for it continuing to be per IP - we do periodically see stuff that changes UA on every request.
It's certainly got to continue to be considered problematic if something fetches
robots.txt
and ignores the rules in there.The main problem in this issue was that the subsequent requests were also sufficiently unusual (change in UA structure etc). The
robots.txt
scoring tipped it over the edge, but it's quite possible that we'd have seen notifications fire for some other requests further down the road.18-Aug-24 09:37
I'm going to close this as fixed.
I may look at getting an instance of FediFetcher up and running to help confirm though
18-Aug-24 13:17
changed the description
18-Aug-24 13:56
mentioned in commit sysconfigs/mastodon-botsnitch-logagent-config@20b930ebd6376d64f66486ba4c5e59d0c92bbf5c
Message
fix: adjust ruleset config to better handle unusual UA forms (project-management-only/scraper-snitch-bot#8)