For a while now, I've been considering whether or not to keep this service up and running.
It's definitely not providing the value that it once did - matches are now quite sporadic and (as a result) false positives make up a bigger proportion of the output.
It's a little off 3 weeks since I launched the service (actually, by sheer coincidence it's 3 years to the day since I broke ground on it).
In the years since I launched the bot, the landscape has changed a bit
robots.txtSadly, these aren't really signs that things have improved and instead indicate that Scraper Snitch's design is now outdated.
Something, incidentally, that the AI industry have also been caught doing. ↩︎
Activity
12-Jan-26 17:38
Risk
Although it might seem tempting to keep the service online and (continue to) deal with false positives as/when they arrive, doing so brings a number of risks.
False Sense Of Security: I've always been very clear that the bot is not a panacea. However, there's a risk that instance admins will currently be feeling more protected than they currently are.
False Positives / Fedi: Being falsely flagged is potentially problematic for the instances being flagged. Hopefully admins receiving an alert will check receipts and not block obvious false positives, but instances have no real recourse (or even awareness) so may be cut off from a portion of the fediverse.
False Postives / Me: The bot operates under the Lawful basis of Legitimate Interests.
When originally standing up the service I conducted a Legitimate Interests Assessment to balance the rights of data subjects. That assessment considered the possibility of inadvertently publishing data relating to an individual (whether a developer who'd put their details in a relevant request header, or was running a scraper from their home IP).
That, to my knowledge, hasn't happened yet. However, an elevated rate of false positives probably also increases the likelihood of that occurring.
So, it'd be prudent to also consider the possible risk that that poses (both to me and the future unfortunate individual).
Usefulness: If the bot is to stick around in any useful capacity, it'll need redesigning and reworking so that it can better detect modern threats.
That's no small task: rewriting the code is the easy bit. I'd first have to come up with a way to (accurately) identify scrapers that are going to significant lengths to mask themselves.
It'll vary by individual, but for most, there's a very good chance that most view AI scrapers as the most significant threat. As noted above, AI companies have gone to huge (and reprehensible) lengths to try and hide their scraping activities.
Although not currently used to collect training data, various AI companies have even launched their own AI Browsers. Presumably, at some point in the future (if/once they've seen sufficient adoption) those browser's inputs will start being used for training, circumventing the protections put in place by the people who actually own the content.
Realistically, the world has had to move on from "detect and block" to "detect and poison".
12-Jan-26 17:38
changed the description
12-Jan-26 17:39
Benefit
Although the bot is not without it's benefit, detections are now relatively rare.
Recent detections are
Prior to Jul 8, detections of Meta scrapers generated a lot of noise (addressed in #9)
Continued detections mean that there is some benefit there.
12-Jan-26 17:48
assigned to @btasker
12-Jan-26 17:52
Having written all of that out, it's quite hard to make a case for keeping the bot online.
It's now only sporadically detecting scrapers and the risk that it potentially poses to others outweighs the tiny benefit that it's delivering.
Although I could, perhaps, make code changes to make it more effective, we're not talking about minor changes: it's an incredibly hard problem to solve and one that I don't really have the time/energy to do justice (if I'm even capable of doing so).
12-Jan-26 18:00
OK, in which case, we need a plan.
Communication
Tidy Up
410 - Gonefor receipts files)Timeline
All else being good, I can probably start drafting a post tomorrow.