project project-management-only / Scraper Snitch Bot avatar

project-management-only/scraper-snitch-bot#5: Save State and regenerate Receipts



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.13
Created: 20-Jan-23 22:38



Description

Currently receipt files don't get regenerated/updated, so the "Last Seen" will only be accurate when the alert fired.

It is possible to manually force a regeneration (by setting DRY_RUN to Y and removing the state files), but

  • That's a bit heavy-handed
  • It requires manual intervention
  • There is info in receipts that's correct at time of generation, we don't want to recalculate that later

It's the final point that bothers me most, we don't want to incorrectly set flags because something changed in config.

This ticket is to record some of that state so that it can be used when a regeneration is run.



Issue Links

Toggle State Changes

Activity


assigned to @btasker

The best place to start is probably to figure out what we want to record for future reference.

I think, in practice, there are two groups - "track" and "update on change" - where the latter is tracked, but also recalculated with any new items added to the tracked changes.

Update on Change

  • rDNS: might be useful/interesting to check if it changes over time. We want it to be current, but also don't want to lose useful information if a bot author realises a mistake and removes the PTR record
  • ASN: Useful to track history - if it changes, it might act as a prompt to check whether the bot is actually still active
  • Tor Exit node: Should track history
  • Last Seen: this should update automatically, but we should also track state - if we stop seeing requests (and expire old logs out), it'd be helpful to have a record of when it was last seen
  • Average number of daily requests
  • Observed useragents
  • Observed Paths
  • Flags

Track

  • First Seen: Want to make sure the earliest date remains the same

Realistically, that's most points. So, it probably makes as much sense to dump the entire receipt object as a state-file, and then worry about the logic above during regeneration.

marked this issue as related to misc/python-mastodon-snitch-bot#1

The bot will now keep track of state.

misc/python-mastodon-snitch-bot#2 will track implementation of receipt refreshing/regeneration.

Will also need to come up with a solution for existing receipt files - there are few enough that it might just be a case of populating their state by hand, but we'll cross that bridge when we come to it.

Regeneration is now largely implemented.

ASN history will not currently be updated in the receipts file as it would mean special changes to handle the ipinfo link. Although AS changes are a potential signal, they're not one that most admins are likely to care about, so it seemed safe to defer this.

Regeneration will however record any AS changes in the internal state, so if we later want to reflect these changes in the receipt file we'll be able to show changes between now and then - the information isn't lost, it just isn't displayed in the receipt file.

Tracking of all other items is implemented.

The next challenge, though, lies in populating state for all existing bots.

I had hoped that running the bot across a large time period would do the job, unfortunately that longer period allows other IPs to cross the reporting threshold, so we've gained about 50% more files.

So, we need to separate out those that have already crossed the threshold, so they can be updated and checked (the first generation date will be wrong, but can be pulled from the original receipts). The additional ones should be checked to see whether they point towards any necessary rule tweaks

Took a listing of state off the bot's host

ls bot_snitch/state/ -1 | tee flist

Copied to to my test dir

ben@optimus:~/tmp/snitch/state$ mkdir flagged
ben@optimus:~/tmp/snitch/state$ cat flist | while read -r l; do mv $l flagged; done

Need to work through them now

Files have been manually corrected and deployed.

I think, given the manual work involved, it's best to cut the release now - the longer it's delayed, the more likely new bots will appear and need state manually generating.

A docker image has been built for v0.13 and deployed onto the host.

A wrapper has been created for it, and scheduled in cron so that regenerations happen every 12 hours (at 10 past the hour)

10 */12 * * * /home/ben/docker_files/crons/bot_snitch_regenerate.sh

I've manually triggered it and run a diff off the receipt files to check there's nothing crazy happening.

changed title from Save State {-for Receipt Regeneration-} to Save State {+and regenerate Receipts+}

changed the description