project project-management-only / Scraper Snitch Bot avatar

project-management-only/scraper-snitch-bot#8: Seems to keep picking Fedifetch up



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 17-Aug-24 22:44



Description

There have been a couple of instances now of IPs being flagged because they're running Fedifetcher, for example

### Overview

Observed Requests: 15
First Seen: 2024-06-27 18:46:51 (UTC)
Last Seen:  2024-08-16 18:05:08 (UTC)

Average number of daily requests: 2.142857142857143

----

### Observed Useragents

  - FediFetcher/7.1.2; +<snipped> (https://go.thms.uk/ff)
  - Mastodon/4.3.0-alpha.5+glitch.0813_115fb0a (http.rb/5.2.0; <snipped>)
  - FediFetcher/7.1.3; +<snipped> (https://go.thms.uk/ff)


----

### Observed Paths

  - /.well-known/webfinger
  - /users/scrapersnitch/collections/tags
  - /users/scrapersnitch/followers
  - /users/scrapersnitch
  - /users/scrapersnitch/collections/featured
  - /users/scrapersnitch/following
  - /robots.txt
  - /users/scrapersnitch/outbox

For a Summary of path sensitivity see https://projects.bentasker.co.uk/gils_projects/wiki/project-management-only/scraper-snitch-bot/page/Request-Paths.html


----

### Flags

  - Fetches-robots.txt
  - Ignores-robots.txt
  - Does-not-fetch-robots.txt

This is undesirable - FediFetcher's a tool to fetch missing replies etc from toots.

The problem is, I made an adjustment the other day to account for this, but the instance above still got flagged up. Need to look at scoring to see why (I'm guessing it's the score resulting from it ignoring robots.txt).

Edit: an overview of FediFetcher has been published on the wiki to try and help admins assess whether its something that they want to permit.



Toggle State Changes

Activity


assigned to @btasker

changed the description

I'm guessing it's the score resulting from it ignoring robots.txt

Probably worth checking the upstream buglist. https://blog.thms.uk/fedifetcher says

FediFetcher also respects blocks in the robots.txt, and this is the most efficient way if you want to block FediFetcher from your instance (see below).

It does appear to be ignoring it (although, perhaps the other paths we're seeing aren't requested by the Fedifetcher UA - it's a distinct python script - need to check that too).

FTR, my robots.txt is

$ curl https://mastodon.bentasker.co.uk/robots.txt

# Robot comrades, you are being watched
User-agent: *
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

OK, so lets looks at the flow of requests for this IP:

The masto instance fetches a status:

-       -       [16/Aug/2024:09:44:02 +0000]    "GET /users/ben/statuses/112971048637951009 HTTP/1.1"   200     2584    "-"     "Mastodon/4.3.0-alpha.5+glitch.0813_115fb0a (http.rb/5.2.0; +<redacted>)"        "-"     "mastodon.bentasker.co.uk"      CACHE_- 2.423   mikasa  -       "-"     "-"     "-"

Then Fedifetcher appears and so fetches robots.txt

 -       -       [16/Aug/2024:16:35:40 +0000]    "GET /robots.txt HTTP/1.1"      200     122     "-"     "FediFetcher/7.1.3; +<redacted> (https://go.thms.uk/ff)"  "-"     "mastodon.bentasker.co.uk"      CACHE_- 0.000   mikasa  -       "-"     "-"     "-"

Later in the day there are subsequent requests, but all made with the Masto UA - i.e. FediFetcher has honoured robots.txt.

So, FediFetcher honours robots.txt, however the scoring looks at whether robots.txt is being ignored on a per IP basis:

        # Note that we've seen this IP request robots.txt
        if o['ip'] not in robotstxt_observed:
            robotstxt_observed.append(o['ip'])

Which does make some sense.

The user-agent header is very much under the client's control and we wouldn't want detection to miss out on a malicious bot which sends a different UA each time.

Funnily enough, the FediFetcher Issue tracker contains a link to an example of where this can happen without it being under the users control:

Unfortunately, if you use Python's "robotparser" function, it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

As noted in the GH, there is a way to work around it, which FF have obviously used as the logs contain a correct UA.

So, the first problem here is essentially a philosophical one: should robots.txt tracking be per ip or per IP and UA?

The latter risks missing things whilst the former leads us, well, here.

We'll figure that out in a bit, first lets check scoring to make sure that that is the issue (and the only issue).

The IPs loglines have been written into ~/sample.log, so

cat ~/sample.log | docker run \
> --rm \
> -i \
> -v $PWD/config.yaml:/config.yaml \
> -e DRY_RUN="Y" \
> registry.bentasker.co.uk/misc/python-mastodon-bot-detection:0.12

Score/request

/robots.txt (FF UA) 0
webfinger (Mast UA) 50  ignored_robotstxt,noreferrer
scrapersnitch user (Mast UA) 50 ignored_robotstxt,noreferrer
..etc..

The robots.txt is (as expected) neutral.

Every request after that, though, had 40 added onto it because the IP was perceived to be ignoring robots.txt

They picked up another 10 for not having a referrer.

That 10 is normally a harmless score, but combining that with ignoring robots.txt gets us into notifiable territory.

None of the requests picked up any other flags.

A follow on question though, is why my previous intervention didn't help.

I added a UA rule which should have biased the scoring to the extent that it didn't fire. Yet, the sample doesn't get tagged with the ua_allow flag.

Ahhh, but it wouldn't...

The only request we see from the allowed UA is for robots.txt. That's a 0-rated request and never proceeds on to the subsequent processing.

The requests after that all use the UA string Mastodon/4.3.0-alpha.5+glitch.0813_115fb0a (http.rb/5.2.0; +<redacted>)

This UA structure is actually quite unusual. The Glitch UA normally looks more like

http.rb/5.2.0 (Mastodon/4.3.0-nightly.2024-07-05-security+glitch; +<redacted>)

Critical bits there are:

  • A bracket before Mastodon/
  • A semi-colon after glitch

Neither of these is true in this client's UA string.

To be fair, it's not just that instance, mastodon.social is rocking a different format as well

Mastodon/4.3.0-nightly.2024-08-12 (http.rb/5.2.0; +https://mastodon.social/)

Just to show how unusual that is, this is a count of loglines, grepping for the two different forms

$ grep '(Mastodon' sample.log  | wc -l
22915
$ grep '"Mastodon' sample.log  | wc -l
1754

It's even worse for glitch

$ grep 'glitch\.' sample.log  | wc -l
337
$ grep 'glitch;' sample.log  | wc -l
13269

So, it's probably not too surprising that those requests flagged up as unusual.

I've adjusted the ruleset config to be more permissive of this.

The changes made will help with this specific incident (and ones like it).

I've gone through receipts and toots and remove the associated IPs (there were 2, 3 including the one I manually cleared out a couple of weeks ago)

For reference, the subnets of the affected IPs were:

This does leave us with the one outstanding question:

should robots.txt tracking be per ip or per IP and UA?

I think there's a reasonable argument for it continuing to be per IP - we do periodically see stuff that changes UA on every request.

It's certainly got to continue to be considered problematic if something fetches robots.txt and ignores the rules in there.

The main problem in this issue was that the subsequent requests were also sufficiently unusual (change in UA structure etc). The robots.txt scoring tipped it over the edge, but it's quite possible that we'd have seen notifications fire for some other requests further down the road.

I'm going to close this as fixed.

I may look at getting an instance of FediFetcher up and running to help confirm though

changed the description

mentioned in commit sysconfigs/mastodon-botsnitch-logagent-config@20b930ebd6376d64f66486ba4c5e59d0c92bbf5c

Commit: sysconfigs/mastodon-botsnitch-logagent-config@20b930ebd6376d64f66486ba4c5e59d0c92bbf5c 
Author: root                            
                            
Date: 2024-08-18T08:36:21.000+00:00 

Message

fix: adjust ruleset config to better handle unusual UA forms (project-management-only/scraper-snitch-bot#8)

+2 -1 (3 lines changed)