Wiki: FediFetcher Information/project-management-only / Scraper Snitch Bot



Instances running the FediFetcher software started getting flagged in issue 8.

After resolving the underlying issue, this page was created to try and provide an independent assessment of FediFetcher.


Background

Fedifetcher is a Python script, which describes itself as follows

FediFetcher is a tool for Mastodon that automatically fetches missing replies and posts from other fediverse instances, and adds them to your own Mastodon instance.

The author wrote about it on their blog and announced release on Mastodon.

Source code is shared at https://github.com/nanos/FediFetcher


Described Behaviour

The author posted a detailed explanation in response to questions/criticism.

On inspection, their explanation is borne out by the code (links below are to the most recent commit at time of writing):

At startup it fetches information about any instances that it knows it's seen before, before progressing to fetch configured input sources (for example by fetching the local user's lists)

For each configured source:

Depending on what's enabled in config, input sources might be


Behaviour Synopsis

Although it requests posts/toots, Fedifetcher itself does not do anything with the content of those toots.

What it's doing is discovering the URLs of replies to specific toots and then telling it's local mastodon instance to fetch those.

Mastodon then attempts to fetch the toot as it would any other.

So, any blocks (instance or user level) will still be honoured (because the Mastodon instance will be unable to fetch those toots).

Additional notes:


Scenario: Normal Flow

Alice, Bob and Carlos are all on different instances. Bob follows Alice, but does not follow Carlos (who also follows Alice)

When Alice toots something interesting:

At this point:

After Fedifetcher has run, though


Scenario: Blocked User

Alice, Bob and Mallory are all on different instances. Mallory follows Alice Bob follows Alice and has blocked Mallory

When Alice toots something interesting:

At this point:

The situation does not change after fedifetcher has run:

It's open to debate, but some might argue that this is, in fact, an improvement. If Bob were reliant on visiting Alice's profile (on Alice's instance), they would see Mallory's reply. With Fedifetcher, they will not.


Admins: Restricting Fedifetcher

At time of writing, Fedifetcher appears to honour both the Allow and Disallow directives in robots.txt.

So, it should be possible to disallow or restrict access:

Disallow everything

User-agent: *
Disallow: /

Disallow only Fedifetcher

User-agent: FediFetcher
Disallow: /

Allow Fedifetcher to a specific user only

User-agent: FediFetcher
Allow: /users/ben/statuses/
Disallow: /

Users: Restricting FediFetcher

Fedifetcher appears to support a selection of profile options:


Admins: Preventing Local Use

Admins who do not want their users to run Fedifetcher themselves should be aware that the tool also applies robots.txt to the API requests it makes to the local instance:

$ kubectl logs -f job.batch/fedifetcher-run-1
2024-08-18 14:33:41 UTC: Starting FediFetcher
2024-08-18 14:33:41 UTC: Getting context for home timeline
2024-08-18 14:33:41 UTC: Error getting timeline toots: Querying https://mastodon.bentasker.co.uk/api/v1/timelines/home prohibited by robots.txt
2024-08-18 14:33:41 UTC: Job failed after 0:00:00.158276.
Traceback (most recent call last):
  File "/app/find_posts.py", line 1689, in <module>
    timeline_toots = get_timeline(arguments.server, token, arguments.home_timeline_length)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/find_posts.py", line 362, in get_timeline
    response = get_toots(url, access_token)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/find_posts.py", line 394, in get_toots
    response = get( url, headers={
               ^^^^^^^^^^^^^^^^^^^
  File "/app/find_posts.py", line 1142, in get
    raise Exception(f"Querying {url} prohibited by robots.txt")    
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: Querying https://mastodon.bentasker.co.uk/api/v1/timelines/home prohibited by robots.txt

So, robots.txt can also be used by an instance admin to prevent their users from running their own instances of the tool.

Conversely, those admins who aren't concerned about it's use should be aware that pre-existing robots.txt rules (specifically, Disallow: *) may also prevent their users from using the tool.

If you wanted to allow users to opt to run it to fetch replies for posts in their timeline, but not allow other instances to run it against you, you might do something like

User-agent: *
Disallow: /

User-agent: FediFetcher
allow: /api/v1/timelines/home

Note: if everyone did this, the tool wouldn't be much use.


Possible Performance Impacts

Fedifetcher's mode of operation means that remote instances may see increased load, particularly if FediFetcher is working through a particularly active thread.

There are 3 associated load profiles

At small scale, this is unlikely to translate to much load/impact. However, load is likely to increase where there are particularly active threads, with a lot of replies coming from 1 instance (whether that's lots of users, or one user of that instance).