project project-management-only / Scraper Snitch Bot avatar

project-management-only/scraper-snitch-bot#9: Allow subnet matching



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.15
Created: 08-Jul-25 09:06



Description

We currently get quite a lot of noise as the result of Meta/Facebook's bots.

They connect using IPv6 and we get a different IP each time:

  • 2a03:2880:3ff:72::
  • 2a03:2880:3ff:71::
  • 2a03:2880:11ff:9::
  • 2a03:2880:2ff:7::

et cetera.

It generates quite a lot of noise and there's limited value in blocking individual IPv6 addresses.

What I'd like to be able to do is to mark an entire subnet to be blocked.

Given the current set of detections, that could take two forms

  • Functionality changes so we can advertise the supernet and ignore any other matches within it
  • or (as a short term fix) allow me to allowlist the supernet so there's no further noise (I can then update notes for the existing matches)
  • or (as a complete hack) add an inverse grep to the log parsing to exclude that specific subnet from inputs (though that'll screw regenerations etc)


Toggle State Changes

Activity


whois gives the following information for this block

inet6num:       2a03:2880:300::/40
netname:        EAG
country:        US
admin-c:        RD4299-RIPE
tech-c:         RD4299-RIPE
status:         ASSIGNED
mnt-by:         fb-neteng
mnt-by:         facebook-neteng

Just for confirmation, we don't currently have the ability to exclude via subnet - it looks for exact matches:

        if bot['ip'] in config['exclude_ips']:
            # Allowlisted IP
            print(f"Skipping {bot['ip']}, present in allowlist")
            continue

I wonder whether the answer is to have calcFileName do the swap?

def calcFileName(ip):
    ''' Turn the bot's IP into a filesystem and url safe filename
    '''
    # generate the filename
    fname = ip.replace(".", "-").replace(":", "-")

    # Play it safe and assume someone found a way to get an invalid IP into the db
    # strip chars so they can't pull filesystem shenanigans
    fname = re.sub('([^(a-f|A-F|\-|0-9)]+)', '', fname)
    return fname

I could have a manual config list of prefixes and have the function override the filename if there's a match.

That would also prevent toots from being sent, the state file will exist.

changed the description

If I do ^ we also need to think about what will happen during receipt regeneration.

It looks like it should be OK, but we need to make sure that the attribute ip in the state file is an IP rather than the subnet (otherwise various lookups may fail)

def regenerate():
    ''' Look for IPs that we know about and regenerate information for them

    [misc/python-mastodon-snitch-bot#2](/issue/misc/python-mastodon-snitch-bot/2.html)
    '''

    # Load the main config
    config = loadConfig(CONFIG_FILE)

    conn, cursor = getCursor(config['flightsql']['host'], 
                             config['flightsql']['port'], 
                             config['flightsql']['token'],
                             config['flightsql']['bucket'])
    runtime = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S (UTC)')


    # Get a list of state files
    dir_path = r'{}/state-*.txt'.format(config['state_dir'])
    res = glob.glob(dir_path)

    for statefile in res:
        with open(statefile, "r") as fh:
            try:
                bot = yaml.safe_load(fh)
            except:
                print(f"Invalid YAML in {statefile}")
                continue

        receipt_file = f"{config['receipt_dir']}/{bot['fname']}.txt"

        if bot['ip'] in config['exclude_ips']:
            # Allowlisted IP, remove the report/state
            print(f"Removing {bot['ip']}, present in allowlist")

            if os.path.exists(receipt_file):
                os.remove(receipt_file)

            os.remove(statefile)
            continue


        # Get new receipts
        new = {}
        new['receipts'] = getReceipts(cursor, bot['ip'], bot['queried_period'], config['flightsql']['forever_filter'], config['robots_disallow_all'].upper())
        new['extended_info'] = getIPInfo(bot['ip'])
        bot['last_checked'] = runtime

        merged_bot = mergeReceipts(bot, new)

        # Write the updated receipt
        writeReceiptFile(merged_bot, config['receipt_dir'], bot['fname'], runtime, config['notes_dir'])

        # Write the latest state to the statefile
        with open(statefile, "w") as fh:
            yaml.dump(merged_bot, fh)
verified

mentioned in commit misc/python-mastodon-snitch-bot@80fce5c549a1ede84a55a8f3754b15d893a91b52

Commit: misc/python-mastodon-snitch-bot@80fce5c549a1ede84a55a8f3754b15d893a91b52 
Author: B Tasker                            
                            
Date: 2025-07-08T10:38:52.000+01:00 

Message

feat: allow IP prefixes to be grouped into a single state and receipt file misc/python-mastodon-snitch-bot#7

+17 -3 (20 lines changed)

The basic functionality is implemented, it's possible to provide a list of string prefixes:

grouped_prefixes:
  - 2a03:2880:3

This isn't going to be massively useful to people though. The aim of that config is to block 2a03:2880:300::/40.

I deliberately went with string matching rather than IP parsing to keep the check cheap, but we need a way to communicate to users which subnet they should be blocking.

One option would be to list subnets and check whether the IP is in it, the other would be to make the config more verbose

grouped_prefixes:
  - name: 2a03:2880:300::/40
    pref: 2a03:2880:3

I think it's probably best to do it properly and parse IPs and Networks - otherwise we'll only end up overblocking by accident

verified

mentioned in commit misc/python-mastodon-snitch-bot@68fb3994d54ba8bb1ec27a783771f0e7a1056717

Commit: misc/python-mastodon-snitch-bot@68fb3994d54ba8bb1ec27a783771f0e7a1056717 
Author: B Tasker                            
                            
Date: 2025-07-08T10:50:23.000+01:00 

Message

feat: group by subnet rather that a string prefix (misc/python-mastodon-snitch-bot#7)

+5 -3 (8 lines changed)

OK, the logic is in place then - what we need to look at now is how to go about exposing information to the user.

The state and receipt filename has a -subnet suffix (e.g. state-2a03-2880-300--40-subnet.txt), but we probably want the receipt contents to note that it's a subnet match.

We should also hide various bits of information

  • The request count will be incorrect (it'll be for whatever IP last triggered it)
  • user_agents will be correct but may be incomplete (if different UA's are used by different IPs within the subnet)
verified

mentioned in commit misc/python-mastodon-snitch-bot@460cee180fe1bd0f2049d53f3980d0804e8bad86

Commit: misc/python-mastodon-snitch-bot@460cee180fe1bd0f2049d53f3980d0804e8bad86 
Author: B Tasker                            
                            
Date: 2025-07-08T11:04:33.000+01:00 

Message

feat: receipt files should correctly reflect that the match is for a subnet (misc/python-mastodon-snitch-bot#7)

+26 -15 (41 lines changed)

Receipt files now reflect that they are for a subnet match:

# Suspected Mastodon Scraper: 2a03:2880:300::/40

File first generated: 2025-07-08 10:05:45 (UTC)
File (re)generated: 2025-07-08 10:05:45 (UTC)


### IP Information

Subnet: 2a03:2880:300::/40
rDNS: None
ASN: [32934](https://ipinfo.io/AS32934)

Tor Exit Node: False

----

### Overview

Warning: the following stats are likely to be incorrect because this match is part of a known wider subnet

Observed Requests: 14
First Seen: 2025-07-06 13:40:55 (UTC)
Last Seen:  2025-07-07 05:12:43 (UTC)

Average number of daily requests: 7.0


We should probably adjust the toot text too.

verified

mentioned in commit misc/python-mastodon-snitch-bot@3d890efc69dc1a2378d72709cde737efbd9ca209

Commit: misc/python-mastodon-snitch-bot@3d890efc69dc1a2378d72709cde737efbd9ca209 
Author: B Tasker                            
                            
Date: 2025-07-08T11:14:55.000+01:00 

Message

feat: adjust the toot text if the match is for a subnet (misc/python-mastodon-snitch-bot#7)

+6 -3 (9 lines changed)

I think we should probably log a flag too, that way people can look it up on the wiki and understand why the receipt file is different.

verified

mentioned in commit misc/python-mastodon-snitch-bot@ce1f798c1d92cbf98af688067155ae0bd5813df6

Commit: misc/python-mastodon-snitch-bot@ce1f798c1d92cbf98af688067155ae0bd5813df6 
Author: B Tasker                            
                            
Date: 2025-07-08T11:19:54.000+01:00 

Message

feat: add flag Subnet-Match (misc/python-mastodon-snitch-bot#7)

+6 -3 (9 lines changed)

assigned to @btasker

Cool, this seems to be working - I've tested against the last 48hrs of events.

Wiki has been updated

Closing ready to cut a release

Just for my own reference in future, when adding a subnet to the config, it's also possible to pre-stage some additional notes.

For example, for 2a03:2880:300::/40 I did

nano notes/2a03-2880-300--40-subnet.txt

FTR, other observed facebook subnets are

  • 2a03:2880:1100::/40
  • 2a03:2880:1300::/40
  • 2a03:2880:2700::/40
  • 2a03:2880:200::/40
  • 2a03:2880:3100::/40
  • 2a03:2880:3200::/40

Realistically, we probably want to group that into 2a03:2880::/32