utilities/telegraf-plugins#1: Tor plugin



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: tor-plugin v0.1
Created: 11-May-22 08:06



Description

Given I run a number of Onion services, it'd probably be prudent to monitor the tor daemon.

I've had a quick search around on the net and there doesn't seem to be a tor plugin for telegraf.

However, it should be possible to create an exec plugin based on Tor's controlspec that pulls statistics out



Toggle State Changes

Activity


assigned to @btasker

OK, as step 1, let's enable ControlPort on a tor instance.

Generate a password hash

/ $ tor --hash-password SecretPass
16:20F64DD23B8043966023A8797DDE0DE3AC697FD8461C1E7B25FF767D47

Edit torrc to enable the controlport and set the password

ControlPort 9051
HashedControlPassword 16:222D0FF1BE77A55760305E8D3A04304BC68FC37B19069DF43E52FF64E1

We can then netcat in and authenticate

/ $ nc 127.0.0.1 9051
AUTHENTICATE "SecretPass"
250 OK

We can pull

total bytes read (downloaded) and written (uploaded)

GETINFO traffic/read
250-traffic/read=5942577
250 OK
GETINFO traffic/written
250-traffic/written=8463596
250 OK

Daemon uptime

GETINFO uptime
250-uptime=339

Current software version

GETINFO version
250-version=0.4.5.10

Whether tor is currently active

A nonnegative integer: zero if Tor is currently active and building circuits, and nonzero if Tor has gone idle due to lack of use or some similar reason.

GETINFO dormant
250-dormant=0

List of circuits and their status (would need further parsing)

GETINFO circuit-status

List of entry guards and their status

GETINFO entry-guards
250+entry-guards=

Whether self tests against the ORPort worked (will report success if orport not configured)

GETINFO status/reachability-succeeded/or
250-status/reachability-succeeded/or=1

Get state for both ORPort and DirPort checks

GETINFO status/reachability-succeeded 
250-status/reachability-succeeded=OR=1 DIR=1

Get text status of current tor version

GETINFO status/version/current
250-status/version/current=recommended
250 OK

Assessment of network state (up/down)

GETINFO network-liveness
250-network-liveness=up
250 OK

So, breaking those down into tag vs fields, I'm inclined to say

tags

  • dormant
  • or_reachability_succeeded
  • dp_reachability_succeeded
  • tor_version_state
  • network_liveness

fields

  • bytes_rx
  • bytes_tx
  • uptime
  • software_version

entry-guards would get broken down into the following fields

  • num_known_entry_guards
  • num_connected_entry_guards
  • num_down_entry_guards
  • num_never_connected_entry_guards
  • num_up_entry_guards
  • num_unusable_entry_guards
  • num_unlisted_entry_guards

circuit-status needs further analysis. Section 4.1.1 of the spec details it

Will look at putting a script together later to connect in and collect these

Within the plugin, most of the stats to be collected are defined within a list:

stats = [
    #cmd, output_name, type, tag/field
    ["traffic/read", "bytes_rx", "int", "field"],
    ["traffic/written", "bytes_rx", "int", "field"],
    ["uptime", "uptime", "int", "field"],
    ["version", "tor_version", "string", "field"],
    ["dormant", "dormant", "int", "field"],
    ["status/reachability-succeeded/or", "orport_reachability", "int", "field"],
    ["status/reachability-succeeded/dr", "dirport_reachability", "int", "field"],

    ["status/version/current", "version_status", "string", "tag"],
    ["network-liveness", "network_liveness", "string", "tag"]
]

The first entry in each is the command to pass with GETINFO into the controlport, the second is the field/tag name we provide to telegraf.

type should be one of int,float,string (I guess we should add bool). It's ignored for tags (as they're always strings)

The final index is whether it should be treated as a tag or a field.

This covers most of the items listed above - we still need to break down and parse entry-guards

This is now mostly built.

Default configuration is at the top of the plugin and can be overridden via environment variable

CONTROL_H = os.getenv("CONTROL_HOST", "127.0.0.1")
CONTROL_P = int(os.getenv("CONTROL_PORT", 9051))
AUTH = os.getenv("CONTROL_AUTH", "MySecretPass")
MEASUREMENT = os.getenv("MEASUREMENT", "tor")

We return some additional tags if we failed to connect (or authenticate) with the Tor daemon

tor,controlport_connection=failed,failure_type=connection stats_fetch_failures=1i
tor,controlport_connection=failed,failure_type=authentication stats_fetch_failures=1i

Assuming that all is well, though, we return LP like this

tor,controlport_connection=success,version_status=recommended,network_liveness=up stats_fetch_failures=0i,bytes_rx=234889036i,bytes_rx=276329651i,uptime=35188i,tor_version="0.4.5.10",dormant=0i,orport_reachability=1i,dirport_reachability=1i,guards_total=22i,guards_never_connected=22i,guards_unusable=0i,guards_unlisted=0i,guards_up=0i,guards_down=0i

The next step then is probably to configure this in a telegraf instance and check it all works

The following config can be used

[[inputs.exec]]
  commands = ["/usr/local/bin/tor-daemon.py"]
  data_format = "influx"

Currently, it isn't possible to override env vars from within Telegraf's config, but when this is included in a release, it'll be possible to do something like

[[inputs.exec]]
  commands = ["/usr/local/bin/tor-daemon.py"]
  data_format = "influx"
  environment = [
    "CONTROL_HOST=127.0.0.1",
    "CONTROL_PORT=9051",
    "CONTROL_AUTH=MySecretPass",
    "MEASUREMENT=tor"
  ]

I now have data appearing in my DB - will look at creating some dashboards once there's a decent amount of data to work with

verified

mentioned in commit github-mirror/telegraf-plugins@fa1995e59596784ef022d7a4cdd24da1051bfa54

Commit: github-mirror/telegraf-plugins@fa1995e59596784ef022d7a4cdd24da1051bfa54 
Author: B Tasker                            
                            
Date: 2022-05-11T19:27:06.000+01:00 

Message

Report a counter of how many stats have failed to fetch. See utilities/telegraf-plugins#1

+4 -2 (6 lines changed)
verified

mentioned in commit github-mirror/telegraf-plugins@ef590847215243757dac97389708d423be018ab0

Commit: github-mirror/telegraf-plugins@ef590847215243757dac97389708d423be018ab0 
Author: B Tasker                            
                            
Date: 2022-05-11T19:03:59.000+01:00 

Message

Start implementing a telegraf-plugin to monitor tor for utilities/telegraf-plugins#1

This currently collects some simple stats via control port

+121 -0 (121 lines changed)
verified

mentioned in commit github-mirror/telegraf-plugins@2787c195c8c625f2e2b965b0fd80bb4455b80e8b

Commit: github-mirror/telegraf-plugins@2787c195c8c625f2e2b965b0fd80bb4455b80e8b 
Author: B Tasker                            
                            
Date: 2022-05-11T20:09:28.000+01:00 

Message

Add file header and README for utilities/telegraf-plugins#1

+112 -0 (112 lines changed)
verified

mentioned in commit github-mirror/telegraf-plugins@2d256804d0b40f2e6887a8e73e9724bcc5419cf0

Commit: github-mirror/telegraf-plugins@2d256804d0b40f2e6887a8e73e9724bcc5419cf0 
Author: B Tasker                            
                            
Date: 2022-05-11T19:22:55.000+01:00 

Message

Add ability to add counters based around multiline responses. see utilities/telegraf-plugins#1

+52 -3 (55 lines changed)

OK, starting with the most obvious graph: network throughput

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_rx" or r._field == "bytes_tx")
  |> filter(fn: (r) => r.host == v.host)
  |> group(columns: ["host", "_field"])
  |> derivative(unit: 1s, nonNegative: true)
  |> aggregateWindow(every: v.windowPeriod, fn: mean)
  |> map(fn: (r) => ({ r with 
      _time: r._time,
      _field: r._field,
      host: r.host,
      _value: r._value * 8.00
  }))  

Screenshot_20220512_094224

Graph to show an overview of guard statuses

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r.host == v.host)
  |> filter(fn: (r) => r._field == "guards_down" or
            r._field == "guards_never_connected" or
            r._field == "guards_total" or
            r._field == "guards_unlisted" or
            r._field == "guards_unusable" or
            r._field == "guards_up")
  |>aggregateWindow(every: v.windowPeriod, fn: max)
  |>keep(columns: ["_time","host", "_field", "_value"])

Screenshot_20220512_092705

Daemon uptime in minutes

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r.host == v.host)
  |> filter(fn: (r) => r._field == "uptime")
  |> aggregateWindow(every: v.windowPeriod, fn: max)
  |> map(fn: (r) => ({ r with
         _value: float(v: r._value) / 60.0
  }))

Screenshot_20220512_093239

Maximum observed upload

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_tx")
  |> filter(fn: (r) => r.host == v.host)
  |> derivative(unit: 1s, nonNegative: true)
  |> max()
  |> map(fn: (r) => ({ r with 
      _value: (r._value * 8.00) / 1000.00  
  }))

With it's counterpart, highest observed download rate

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_rx")
  |> filter(fn: (r) => r.host == v.host)
  |> derivative(unit: 1s, nonNegative: true)
  |> max()
  |> map(fn: (r) => ({ r with 
      _value: (r._value * 8.00) / 1000.00  
  }))

Screenshot_20220512_094452

Kibibytes downloaded

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_rx")
  |> filter(fn: (r) => r.host == v.host)
  |> group()
  |> difference()
  |> filter(fn: (r) => r._value > 0)
  |> sum()
  |> map(fn: (r) => ({ r with
    _value: r._value / 1024
  }))

Turning the network liveness result into a hot/cold gauge

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_rx")
  |> filter(fn: (r) => r.host == v.host)
  |> last()
  |> map(fn: (r) => ({ 
     host: r.host,
     _value: if r.network_liveness == "up" 
             then
                1
             else
                0    
     ,
     _field: "network_liveness"
  }))

Screenshot_20220512_095717

Doing the same for software version assessment

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "tor")
  |> filter(fn: (r) => r._field == "bytes_rx")
  |> filter(fn: (r) => r.host == v.host)
  |> last()
  |> map(fn: (r) => ({ 
     host: r.host,
     _value: 
             if r.version_status == "recommended" or r.version_status == "new" or r.version_status == "new in series"
             then
                // Good to go
                5
             else if r.version_status == "old"
             then
                // might be an issue in future
                3
             else if r.version_status == "unrecommended" or r.version_status == "obsolete"
             then
                // Uhoh
                1
             else
                // Unknown
                7
      ,
     _field: "version_status"
  }))

Screenshot_20220512_100942

mentioned in issue #2