project project-management-only / Scraper Snitch Bot avatar

project-management-only/scraper-snitch-bot#10: Should We Keep The Service Running?



Issue Information

Issue Type: issue
Status: opened
Reported By: btasker
Assigned To: btasker

Created: 12-Jan-26 17:38



Description

For a while now, I've been considering whether or not to keep this service up and running.

It's definitely not providing the value that it once did - matches are now quite sporadic and (as a result) false positives make up a bigger proportion of the output.

It's a little off 3 weeks since I launched the service (actually, by sheer coincidence it's 3 years to the day since I broke ground on it).

In the years since I launched the bot, the landscape has changed a bit

  • More scrapers now honour robots.txt
  • Those that don't are now much more surreptitious, requests originating from a wide range of IPs and UAs1
  • There are more legitimate services (like FediFetcher) using APIs
  • Increases in IPv6 usage increase the potential for duplicate alerts (though this partially mitigated in v0.15)

Sadly, these aren't really signs that things have improved and instead indicate that Scraper Snitch's design is now outdated.


  1. Something, incidentally, that the AI industry have also been caught doing↩︎



Toggle State Changes

Activity


Risk

Although it might seem tempting to keep the service online and (continue to) deal with false positives as/when they arrive, doing so brings a number of risks.

False Sense Of Security: I've always been very clear that the bot is not a panacea. However, there's a risk that instance admins will currently be feeling more protected than they currently are.


False Positives / Fedi: Being falsely flagged is potentially problematic for the instances being flagged. Hopefully admins receiving an alert will check receipts and not block obvious false positives, but instances have no real recourse (or even awareness) so may be cut off from a portion of the fediverse.


False Postives / Me: The bot operates under the Lawful basis of Legitimate Interests.

When originally standing up the service I conducted a Legitimate Interests Assessment to balance the rights of data subjects. That assessment considered the possibility of inadvertently publishing data relating to an individual (whether a developer who'd put their details in a relevant request header, or was running a scraper from their home IP).

That, to my knowledge, hasn't happened yet. However, an elevated rate of false positives probably also increases the likelihood of that occurring.

So, it'd be prudent to also consider the possible risk that that poses (both to me and the future unfortunate individual).


Usefulness: If the bot is to stick around in any useful capacity, it'll need redesigning and reworking so that it can better detect modern threats.

That's no small task: rewriting the code is the easy bit. I'd first have to come up with a way to (accurately) identify scrapers that are going to significant lengths to mask themselves.

It'll vary by individual, but for most, there's a very good chance that most view AI scrapers as the most significant threat. As noted above, AI companies have gone to huge (and reprehensible) lengths to try and hide their scraping activities.

Although not currently used to collect training data, various AI companies have even launched their own AI Browsers. Presumably, at some point in the future (if/once they've seen sufficient adoption) those browser's inputs will start being used for training, circumventing the protections put in place by the people who actually own the content.

Realistically, the world has had to move on from "detect and block" to "detect and poison".

changed the description

Benefit

Although the bot is not without it's benefit, detections are now relatively rare.

Recent detections are

  • Today: False positive
  • Jan 4: A python based scraper
  • Nov 2: A python based scraper
  • Aug 29: A UA-less bot collecting instance details
  • Jul 8: A Meta/Facebook scraper

Prior to Jul 8, detections of Meta scrapers generated a lot of noise (addressed in #9)

Continued detections mean that there is some benefit there.

assigned to @btasker

Having written all of that out, it's quite hard to make a case for keeping the bot online.

It's now only sporadically detecting scrapers and the risk that it potentially poses to others outweighs the tiny benefit that it's delivering.

Although I could, perhaps, make code changes to make it more effective, we're not talking about minor changes: it's an incredibly hard problem to solve and one that I don't really have the time/energy to do justice (if I'm even capable of doing so).

OK, in which case, we need a plan.


Communication

  • Deprecation notice: I don't think there's any real value in advance notice of deprecation. If the system were generating matches more regularly it'd probably be worth it, but as it's not I think we can safely just turn it off.
  • Supporting Blog Post: The system was launched via blog post so I should probably advertise the shut-down with similar prominence
  • System toot: the bot user should toot out a link to the above blog post

Tidy Up

  • Bot: The bot and log analyser can be disabled as soon as notifications have gone out
  • Receipts deletion: to give admins time to review matches, receipts shouldn't be deleted immediately. But, they do need to be tidied away - I think we should do that a week after publicising the shutdown (at which point we should return a 410 - Gone for receipts files)
  • InfluxDB deletion: We can delete that data immediately
  • Toots: Tooted alerts are followers only, so it's probably OK to leave them there for a little longer. However, I'll enable time based auto-deletion so that they begin to expire out (note: need to double check that that can be applied retroactively)
  • Privacy Policy Wording: I think it's OK to leave the Privacy Policy as-is for the time being, but should eventually rewrite it to remove the reference to publication

Timeline

All else being good, I can probably start drafting a post tomorrow.