#5 Mirror twitter into InfluxDB : jira-projects/MISC#5

btasker Permalink
27-Apr-22 07:40

assigned to @btasker

btasker Permalink
27-Apr-22 07:40

assigned to @btasker

btasker Permalink
27-Apr-22 07:41

The first challenge is going to be actually getting the tweets - apparently the API will only allow you to fetch the latest 3200 tweets for a user.

This project claims to be able to get more, but requires Mono... Might be able to dockerise it though

btasker Permalink
27-Apr-22 08:00

verified

mentioned in commit utilities/docker-twitter-dump@70c55b997411986d3200896dd21d0f8715431500

Commit: utilities/docker-twitter-dump@70c55b997411986d3200896dd21d0f8715431500 
Author: B Tasker                            
                            
Date: 2022-04-27T08:58:50.000+01:00

Message

Create Dockerfile to build image for project-management-only/staging#1

This makes the tooling available (we can now run twitter-dump), but need to create a workflow that allows it to actually be used without having to constantly exec into a preconfigured image.

+25 -0 (25 lines changed)

btasker Permalink
27-Apr-22 10:31

Auth setup on the container works, but unfortunately it looks like the utility no longer works (to be fair, the most recent commit was like 9 months ago and it relies on an undocumented API).

Running the query results in

twitterdump@1667b6370a15:/$ twitter-dump search -q "(from:bentasker)" -o bentasker.json
[10:23:55 INF] Requesting https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_composer_source=true&include_ext_alt_text=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&tweet_search_mode=live&count=100&query_source=typed_query&pc=1&spelling_corrections=1&ext=mediaStats,highlightedLabel,cameraMoment&q=(from%3Abentasker)..
Unhandled exception. Newtonsoft.Json.JsonReaderException: Unexpected character encountered while parsing value: ath '', line 0, position 0.
   at Newtonsoft.Json.JsonTextReader.ParseValue()
   at Newtonsoft.Json.JsonReader.ReadAndMoveToContent()
   at Newtonsoft.Json.JsonReader.ReadForType(JsonContract contract, Boolean hasConverter)
   at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
   at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonSerializer.Deserialize(JsonReader reader, Type objectType)
   at Newtonsoft.Json.JsonConvert.DeserializeObject(String value, Type type, JsonSerializerSettings settings)
   at Newtonsoft.Json.JsonConvert.DeserializeObject[T](String value, JsonSerializerSettings settings)
   at TwitterDump.Twitter.GetTweets(String url, Dictionary`2 headers, String& nextCursor) in /home/pknopf/git/twitter-dump/src/TwitterDump/Twitter.cs:line 128
   at TwitterDump.Twitter.SearchTweets(String query, Dictionary`2 headers, Nullable`1 pageSize, Nullable`1 maxResults) in /home/pknopf/git/twitter-dump/src/TwitterDump/Twitter.cs:line 60
   at TwitterDump.Program.Search(SearchOptions options) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 65
   at TwitterDump.Program.<>c.<Main>b__0_0(SearchOptions opts) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 32
   at CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1 result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc)
   at TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 30
Aborted (core dumped)

It's not entirely clear why it's failing - if curl the URL it mentions (with all the boilerplate taken from dev tools) I get value JSON back.

It looks like the code has to walk pagination though, so a simple curl isn't going to get everything

btasker Permalink
27-Apr-22 11:15

I wondered if perhaps the issue was because copy as curl includes gzip in Accept-Encoding so I stripped that out of the value copied from adaptive.json and it's working this time.

Process:

docker run -it b2060dd5aa15 sh
twitter-dump auth
# See note below
twitter-dump search -q "(from:bentasker)" -o bentasker.json

Once complete (12774 tweets!)

ben@optimus:~/tmp$ docker cp 6440e89f87c6:/home/twitterdump/bentasker.json ./

When I was prompted to paste into the terminal, I first pasted into a text editor and stripped

-H 'Accept-Encoding: gzip, deflate, br'

from the generated curl.

Maybe the first run was bad luck though, I'll have to retry and see whether I can repro

btasker Permalink
27-Apr-22 16:15

Now have a JSON file with the format

{
  "query": "(from:bentasker)",
  "tweets": [
    {
      "url": "https://twitter.com/bentasker/status/1519252555716513793",
      "id": 1519252555716513793,
      "created_at": "2022-04-27T09:50:10+00:00",
      "full_text": "@neil_neilzone I liked Gentoo - wasn't my starter though (that was Mandrake).\n\nI don't have the time/willingness for a Gentoo install nowadays though.",
      "user_id": 124810735
    },
   ....
  ],
  "users": [
    {
      "id": 124810735,
      "name": "Ben Tasker",
      "screen_name": "bentasker"
    },
    ....
  ]

Observations:

it doesn't include retweets
Links in tweets are all rewritten to t.co URLs

Both impact some of the stats I was interested in pulling out.

It looks like a specific search is needed for retweets (https://webapps.stackexchange.com/questions/92616/use-twitter-advanced-search-to-find-retweets-made-by-a-single-account) so we might still be able to extract those.

Rewriting the t.co URL's just means requesting the URL and seeing where it redirects to

btasker Permalink
28-Apr-22 07:23

verified

mentioned in commit misc/python_build_tweets_db@479f94e072171c95689e37740986ea2d67c08a83

Commit: misc/python_build_tweets_db@479f94e072171c95689e37740986ea2d67c08a83 
Author: B Tasker                            
                            
Date: 2022-04-27T17:50:44.000+01:00

Message

Create script to read a twitter dump and insert it into InfluxDB for project-management-only/staging#1

+54 -0 (54 lines changed)

btasker Permalink
28-Apr-22 07:25

We now have basic data in the DB and can query it out

from(bucket: "mybucket")
  |> range(start: 2009-01-12T00:00:00Z)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "tweet_text")
  |> group()
  |> aggregateWindow(every: 1mo, fn: count, createEmpty: true)

Screenshot_20220428_082236

Currently though, there's no link handling and we've not yet handled retweets

btasker Permalink
28-Apr-22 08:12

Looks like there's no way to reliably get all retweets for a user.

Using

from:bentasker include:nativeretweets filter:nativeretweets

Gets me a couple of weeks

Using

from:bentasker include:nativeretweets include:retweets

Gets a long list of tweets back, but about the same amount of re-tweets.

Would have been nice to include but I'm not that bothered, so I'll skip over it for the time being

btasker Permalink
29-Apr-22 08:14

The script now does a bit of additional identification:

Tags

Tweet id (tag: id)
user id (tag: user_id)
user handle (tag: user_handle)
Does the tweet contain links (tag: contains_links)
Does the tweet contain mentions (tag: has_mentions)
Does the tweet contain images (tag: has_image)
Does the tweet contain swearing (tag: has_swear)
Does the tweet contain hashtags (tag: has_hashtag)

Fields

Tweet URL (field: url)
Tweet Text (field: tweet_text)
How many links (field: num_links)
How many mentions (field: num_mentions)
CSV who did I mention (field: mentions)
Number of images (field: num_images)
Number of swear words (field: num_swear)
CSV What did I say? (field: swear_words)
CSV What did it match? (field: matched_swears)
Number of words in tweet (field: num_words)
Number of hashtags in tweet (field: num_hashtags)
CSV of hashtags in tweet (field: hashtags)

btasker Permalink
02-May-22 11:18

So, one of the things I'm curious to look at is the rate of derivative swear words - i.e. a base word is found but has been expanded into some other word.

So, I want to do something like the following

import "strings"
searchword = "shit"

from(bucket: "mybucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Split out the words used
  |> map(fn: (r) => ({r with words: strings.split(v: r["_value"], t: ",")}))

However, that fails with

runtime error @14:6-14:77: map: map object property "words" is array type which is not supported in a flux table

In order to do this, need to use display()

import "strings"
searchword = "shit"

from(bucket: "mybucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Split out the words used
  |> map(fn: (r) => ({r with words: display(v: strings.split(v: r["_value"], t: ","))}))

Although, we don't actually need/want the display version, we just want to generate stats based on it's content.

So, this Flux

import "strings"

searchword = "shit"

from(bucket: "mybucket")
  |> range(start: 2010-01-01T00:00:00Z)
  |> filter(fn: (r) => r["_measurement"] == "tweets")
  |> filter(fn: (r) => r["_field"] == "swear_words")
  |> filter(fn: (r) => r["has_swear"] == "True")

  // Filter down to our base word
  |> filter(fn: (r) => strings.containsStr(v: r["_value"], substr: searchword))

  // Identify whether the exact searchword was used, or whether we've detected a derivative
  |> map(fn: (r) => ({r with 
        direct_use: if contains(value: searchword, set: strings.split(v: r["_value"], t: "-"))
        then
           "true"
        else
           "false"
  }))

  // Group on our new dimension
  |> group(columns: ["direct_use"])

  // Count per week
  |> aggregateWindow(every: 1w, fn: count, createEmpty: true)

Does the following

Filter records down to those that contain searchword in the field swear_words
Explode that CSV out into an array
Check whether the array contains a direct match and set direct_use to true if so
Group records based on their direct_use value
Generate a weekly count of usage for each group

This allows us to see how often a word is used, and whether it's used on it's own or used to form a derivative (i.e. shit vs shithousery)

Screenshot_20220502_121710

With the benefit of hindsight, I could actually have searched out swear words using Flux rather than pre-processing it in Python. I may give that a try in a bit anyway as an exercise.

btasker Permalink
17-Jun-22 18:37

Closing this out as I've not really played around with it much since, and I'm probably not likely to.

btasker Permalink
27-Apr-22 07:40

moved from project-management-only/staging#1

btasker Permalink
03-Nov-22 16:58

verified

mentioned in commit utilities/docker-twitter-dump@39ca24ff0773df1d87dd21c5ece649de7c31bc22

Commit: utilities/docker-twitter-dump@39ca24ff0773df1d87dd21c5ece649de7c31bc22 
Author: B Tasker                            
                            
Date: 2022-11-03T16:57:29.000+00:00

Message

This was originally built for jira-projects/MISC#5

Given recent developments on Twitter though, polishing it off so that it can be used to archive tweet history.

+58 -1 (59 lines changed)

jira-projects/MISC#5: Mirror twitter into InfluxDB

Issue Information

Activity