Elon Musk has bought Twitter and it's currently seeing quite a few users leave.
I'm not ready to leave (yet), but this whole thing has made me think again about my Twitter feed.
I've been on Twitter since March 2010 and likely have thousands of tweets.
Most (if not all) of those tweets are probably quite low-value, but I'd like the ability to archive them for future reference.
More importantly, I'd be interested to see how my tweeting habits have changed over time - there was a time when I used Twitter quite a lot, so it'd be interesting to chart out trends (tweets per day, %age of re-tweets etc)
Activity
27-Apr-22 07:40
assigned to @btasker
27-Apr-22 07:40
moved from project-management-only/staging#1
27-Apr-22 07:40
assigned to @btasker
27-Apr-22 07:41
The first challenge is going to be actually getting the tweets - apparently the API will only allow you to fetch the latest 3200 tweets for a user.
This project claims to be able to get more, but requires Mono... Might be able to dockerise it though
27-Apr-22 08:00
mentioned in commit utilities/docker-twitter-dump@70c55b997411986d3200896dd21d0f8715431500
Message
Create Dockerfile to build image for project-management-only/staging#1
This makes the tooling available (we can now run
twitter-dump
), but need to create a workflow that allows it to actually be used without having to constantlyexec
into a preconfigured image.27-Apr-22 10:31
Auth setup on the container works, but unfortunately it looks like the utility no longer works (to be fair, the most recent commit was like 9 months ago and it relies on an undocumented API).
Running the query results in
It's not entirely clear why it's failing - if
curl
the URL it mentions (with all the boilerplate taken from dev tools) I get value JSON back.It looks like the code has to walk pagination though, so a simple curl isn't going to get everything
27-Apr-22 11:15
I wondered if perhaps the issue was because
copy as curl
includesgzip
inAccept-Encoding
so I stripped that out of the value copied fromadaptive.json
and it's working this time.Process:
Once complete (12774 tweets!)
When I was prompted to paste into the terminal, I first pasted into a text editor and stripped
from the generated
curl
.Maybe the first run was bad luck though, I'll have to retry and see whether I can repro
27-Apr-22 16:15
Now have a JSON file with the format
Observations:
t.co
URLsBoth impact some of the stats I was interested in pulling out.
It looks like a specific search is needed for retweets (https://webapps.stackexchange.com/questions/92616/use-twitter-advanced-search-to-find-retweets-made-by-a-single-account) so we might still be able to extract those.
Rewriting the
t.co
URL's just means requesting the URL and seeing where it redirects to28-Apr-22 07:23
mentioned in commit misc/python_build_tweets_db@479f94e072171c95689e37740986ea2d67c08a83
Message
Create script to read a twitter dump and insert it into InfluxDB for project-management-only/staging#1
28-Apr-22 07:25
We now have basic data in the DB and can query it out
Currently though, there's no link handling and we've not yet handled retweets
28-Apr-22 08:12
Looks like there's no way to reliably get all retweets for a user.
Using
Gets me a couple of weeks
Using
Gets a long list of tweets back, but about the same amount of re-tweets.
Would have been nice to include but I'm not that bothered, so I'll skip over it for the time being
29-Apr-22 08:14
The script now does a bit of additional identification:
Tags
id
)user_id
)user_handle
)contains_links
)has_mentions
)has_image
)has_swear
)has_hashtag
)Fields
url
)tweet_text
)num_links
)num_mentions
)mentions
)num_images
)num_swear
)swear_words
)matched_swears
)num_words
)num_hashtags
)hashtags
)02-May-22 11:18
So, one of the things I'm curious to look at is the rate of derivative swear words - i.e. a base word is found but has been expanded into some other word.
So, I want to do something like the following
However, that fails with
In order to do this, need to use
display()
Although, we don't actually need/want the display version, we just want to generate stats based on it's content.
So, this Flux
Does the following
searchword
in the fieldswear_words
direct_use
totrue
if sodirect_use
valueThis allows us to see how often a word is used, and whether it's used on it's own or used to form a derivative (i.e.
shit
vsshithousery
)With the benefit of hindsight, I could actually have searched out swear words using Flux rather than pre-processing it in Python. I may give that a try in a bit anyway as an exercise.
17-Jun-22 18:37
Closing this out as I've not really played around with it much since, and I'm probably not likely to.
03-Nov-22 16:58
mentioned in commit utilities/docker-twitter-dump@39ca24ff0773df1d87dd21c5ece649de7c31bc22
Message
This was originally built for jira-projects/MISC#5
Given recent developments on Twitter though, polishing it off so that it can be used to archive tweet history.