project jira-projects / Miscellaneous avatar

jira-projects/MISC#5: Mirror twitter into InfluxDB



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 27-Apr-22 07:40



Description

Elon Musk has bought Twitter and it's currently seeing quite a few users leave.

I'm not ready to leave (yet), but this whole thing has made me think again about my Twitter feed.

I've been on Twitter since March 2010 and likely have thousands of tweets.

Most (if not all) of those tweets are probably quite low-value, but I'd like the ability to archive them for future reference.

More importantly, I'd be interested to see how my tweeting habits have changed over time - there was a time when I used Twitter quite a lot, so it'd be interesting to chart out trends (tweets per day, %age of re-tweets etc)



Toggle State Changes

Activity


assigned to @btasker

assigned to @btasker

The first challenge is going to be actually getting the tweets - apparently the API will only allow you to fetch the latest 3200 tweets for a user.

This project claims to be able to get more, but requires Mono... Might be able to dockerise it though

verified

mentioned in commit utilities/docker-twitter-dump@70c55b997411986d3200896dd21d0f8715431500

Commit: utilities/docker-twitter-dump@70c55b997411986d3200896dd21d0f8715431500 
Author: B Tasker                            
                            
Date: 2022-04-27T08:58:50.000+01:00 

Message

Create Dockerfile to build image for project-management-only/staging#1

This makes the tooling available (we can now run twitter-dump), but need to create a workflow that allows it to actually be used without having to constantly exec into a preconfigured image.

+25 -0 (25 lines changed)

Auth setup on the container works, but unfortunately it looks like the utility no longer works (to be fair, the most recent commit was like 9 months ago and it relies on an undocumented API).

Running the query results in

twitterdump@1667b6370a15:/$ twitter-dump search -q "(from:bentasker)" -o bentasker.json
[10:23:55 INF] Requesting https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_composer_source=true&include_ext_alt_text=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&tweet_search_mode=live&count=100&query_source=typed_query&pc=1&spelling_corrections=1&ext=mediaStats,highlightedLabel,cameraMoment&q=(from%3Abentasker)..
Unhandled exception. Newtonsoft.Json.JsonReaderException: Unexpected character encountered while parsing value: ath '', line 0, position 0.
   at Newtonsoft.Json.JsonTextReader.ParseValue()
   at Newtonsoft.Json.JsonReader.ReadAndMoveToContent()
   at Newtonsoft.Json.JsonReader.ReadForType(JsonContract contract, Boolean hasConverter)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
   at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonConvert.DeserializeObject(String value, Type type, JsonSerializerSettings settings)
   at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value, JsonSerializerSettings settings)
   at TwitterDump.Twitter.GetTweets(String url, Dictionary`2 headers, String& nextCursor) in /home/pknopf/git/twitter-dump/src/TwitterDump/Twitter.cs:line 128
   at TwitterDump.Twitter.SearchTweets(String query, Dictionary`2 headers, Nullable`1 pageSize, Nullable`1 maxResults) in /home/pknopf/git/twitter-dump/src/TwitterDump/Twitter.cs:line 60
   at TwitterDump.Program.Search(SearchOptions options) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 65
   at TwitterDump.Program.<>c.<Main>b__0_0(SearchOptions opts) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 32
   at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc)
   at TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 30
Aborted (core dumped)

It's not entirely clear why it's failing - if curl the URL it mentions (with all the boilerplate taken from dev tools) I get value JSON back.

It looks like the code has to walk pagination though, so a simple curl isn't going to get everything

I wondered if perhaps the issue was because copy as curl includes gzip in Accept-Encoding so I stripped that out of the value copied from adaptive.json and it's working this time.

Process:

docker run -it b2060dd5aa15 sh
twitter-dump auth
# See note below
twitter-dump search -q "(from:bentasker)" -o bentasker.json

Once complete (12774 tweets!)

ben@optimus:~/tmp$ docker cp 6440e89f87c6:/home/twitterdump/bentasker.json ./

When I was prompted to paste into the terminal, I first pasted into a text editor and stripped

-H 'Accept-Encoding: gzip, deflate, br'

from the generated curl.

Maybe the first run was bad luck though, I'll have to retry and see whether I can repro

Now have a JSON file with the format

{
  "query": "(from:bentasker)",
  "tweets": [
    {
      "url": "https://twitter.com/bentasker/status/1519252555716513793",
      "id": 1519252555716513793,
      "created_at": "2022-04-27T09:50:10+00:00",
      "full_text": "@neil_neilzone I liked Gentoo - wasn't my starter though (that was Mandrake).\n\nI don't have the time/willingness for a Gentoo install nowadays though.",
      "user_id": 124810735
    },
   ....
  ],
  "users": [
    {
      "id": 124810735,
      "name": "Ben Tasker",
      "screen_name": "bentasker"
    },
    ....
  ]

Observations:

  • it doesn't include retweets
  • Links in tweets are all rewritten to t.co URLs

Both impact some of the stats I was interested in pulling out.

It looks like a specific search is needed for retweets (https://webapps.stackexchange.com/questions/92616/use-twitter-advanced-search-to-find-retweets-made-by-a-single-account) so we might still be able to extract those.

Rewriting the t.co URL's just means requesting the URL and seeing where it redirects to

verified

mentioned in commit misc/python_build_tweets_db@479f94e072171c95689e37740986ea2d67c08a83

Commit: misc/python_build_tweets_db@479f94e072171c95689e37740986ea2d67c08a83 
Author: B Tasker                            
                            
Date: 2022-04-27T17:50:44.000+01:00 

Message

Create script to read a twitter dump and insert it into InfluxDB for project-management-only/staging#1

+54 -0 (54 lines changed)

We now have basic data in the DB and can query it out

from(bucket: "mybucket")
  |> range(start: 2009-01-12T00:00:00Z)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "tweet_text")
  |> group()
  |> aggregateWindow(every: 1mo, fn: count, createEmpty: true)

Screenshot_20220428_082236

Currently though, there's no link handling and we've not yet handled retweets

Looks like there's no way to reliably get all retweets for a user.

Using

from:bentasker include:nativeretweets filter:nativeretweets

Gets me a couple of weeks

Using

from:bentasker include:nativeretweets include:retweets

Gets a long list of tweets back, but about the same amount of re-tweets.

Would have been nice to include but I'm not that bothered, so I'll skip over it for the time being

The script now does a bit of additional identification:

Tags

  • Tweet id (tag: id)
  • user id (tag: user_id)
  • user handle (tag: user_handle)
  • Does the tweet contain links (tag: contains_links)
  • Does the tweet contain mentions (tag: has_mentions)
  • Does the tweet contain images (tag: has_image)
  • Does the tweet contain swearing (tag: has_swear)
  • Does the tweet contain hashtags (tag: has_hashtag)

Fields

  • Tweet URL (field: url)
  • Tweet Text (field: tweet_text)
  • How many links (field: num_links)
  • How many mentions (field: num_mentions)
  • CSV who did I mention (field: mentions)
  • Number of images (field: num_images)
  • Number of swear words (field: num_swear)
  • CSV What did I say? (field: swear_words)
  • CSV What did it match? (field: matched_swears)
  • Number of words in tweet (field: num_words)
  • Number of hashtags in tweet (field: num_hashtags)
  • CSV of hashtags in tweet (field: hashtags)

So, one of the things I'm curious to look at is the rate of derivative swear words - i.e. a base word is found but has been expanded into some other word.

So, I want to do something like the following

import "strings"
searchword = "shit"

from(bucket: "mybucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Split out the words used
  |> map(fn: (r) => ({r with words: strings.split(v: r["_value"], t: ",")}))

However, that fails with

runtime error @14:6-14:77: map: map object property "words" is array type which is not supported in a flux table

In order to do this, need to use display()

import "strings"
searchword = "shit"

from(bucket: "mybucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Split out the words used
  |> map(fn: (r) => ({r with words: display(v: strings.split(v: r["_value"], t: ","))}))

Although, we don't actually need/want the display version, we just want to generate stats based on it's content.

So, this Flux

import "strings"

searchword = "shit"

from(bucket: "mybucket")
  |> range(start: 2010-01-01T00:00:00Z)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Identify whether the exact searchword was used, or whether we've detected a derivative
  |> map(fn: (r) => ({r with 
        direct_use: if contains(value: searchword, set: strings.split(v: r["_value"], t: "-"))
        then
           "true"
        else
           "false"
  }))

  // Group on our new dimension
  |> group(columns: ["direct_use"])

  // Count per week
  |> aggregateWindow(every: 1w, fn: count, createEmpty: true)

Does the following

  • Filter records down to those that contain searchword in the field swear_words
  • Explode that CSV out into an array
  • Check whether the array contains a direct match and set direct_use to true if so
  • Group records based on their direct_use value
  • Generate a weekly count of usage for each group

This allows us to see how often a word is used, and whether it's used on it's own or used to form a derivative (i.e. shit vs shithousery)

Screenshot_20220502_121710

With the benefit of hindsight, I could actually have searched out swear words using Flux rather than pre-processing it in Python. I may give that a try in a bit anyway as an exercise.

Closing this out as I've not really played around with it much since, and I'm probably not likely to.

verified

mentioned in commit utilities/docker-twitter-dump@39ca24ff0773df1d87dd21c5ece649de7c31bc22

Commit: utilities/docker-twitter-dump@39ca24ff0773df1d87dd21c5ece649de7c31bc22 
Author: B Tasker                            
                            
Date: 2022-11-03T16:57:29.000+00:00 

Message

This was originally built for jira-projects/MISC#5

Given recent developments on Twitter though, polishing it off so that it can be used to archive tweet history.

+58 -1 (59 lines changed)