#25 Unexpected results for some datasets : utilities/python_influxdb

btasker Permalink
05-May-23 07:40

assigned to @btasker

btasker Permalink
05-May-23 07:41

changed the description

btasker Permalink
05-May-23 07:45

I think the issue with the interface statistics is just that the chosen means of downsampling is incorrect.

The underlying fields are cumulative counters - they show the total number of octets/errors etc since the router's networking stack last restarted.

Taking a mean of that and then later using derivative() probably is going to cause unexpected results.

What we should actually do, is have the downsample job use derivative() so that the downsampled data represents the data rate rather than a view of the raw counter's state. It's quite possible that that alone will be enough to give the correct values.

The internet speed check is a little harder - will need to look at the underlying raw data. Given the report is consistently about half though, I would guess that there are some low (like 0 or 1) values being taken into account and skewing the mean.

btasker Permalink
05-May-23 08:30

We don't have derivative() available as an aggregate (I guess it'd be a pre-processor really. Raised #26 to track) so we'll need to use a custom flux query:

downsample_openwrt_interface_rates:
    # Name for the task
    name: "Downsample OpenWRT Interface rates"
    influx: home1x

    # Taken from range()
    period: 60

    # Take from aggregate Window
    window: 15


    query: >
        from(bucket: "telegraf/autogen")
          |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
          |> filter(fn: (r) => r._measurement == "interface")
          |> filter(fn: (r) => r.hostname == "OpenWRT")
          |> filter(fn: (r) => r._field == "ifHCInOctets" or
               r._field == "ifHCOutOctets" or
               r._field == "ifInErrors"
            )
          |> derivative(nonNegative: true)
          |> map(fn: (r) => ({r with _value: float(v: r._value) * 8.0}))          


    # We want to generate each of the aggregates
    # and append the suffixes created in the map() calls
    aggregates: 
        max:
          as:
           ifHCInOctets: "max_in_bps"
           ifHCOutOctets: "max_out_bps"
           ifInErrors: "max_err_s"
        mean:
          as:
           ifHCInOctets: "mean_in_bps"
           ifHCOutOctets: "mean_out_bps"
           ifInErrors: "mean_err_s"

    output_influx: 
        - influx: home2xreal
    output_bucket: telegraf/telegraf_low_granularity

I've just given that a test run.

Querying with

from(bucket: "telegraf/telegraf_low_granularity")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "interface")
  |> filter(fn: (r) => r._field == "mean_in_bps" or r._field == "mean_out_bps")
  |> filter(fn: (r) => r.hostname == "OpenWRT")
  |> filter(fn: (r) => r.ifDescr == "pppoe-wan")
  |> aggregateWindow(every: v.windowPeriod, fn: mean)

currently gives a single datapoint: around 3Mbps.

Although that sounds like a realistic figure, it's not the value I'm getting from the raw data.

If we run the query that Downsample is running (but filter for ppoe-wan to reduce noise), we don't get anything near that reading around 09:05

Screenshot_20230505_091340

Just for avoidance of doubt, if we add

|> aggregateWindow(every: v.windowPeriod, fn: mean)

to the query, we do not see the numbers skew up - they remain in the kbps range.

Going to the downsampled data then (we've gained an additional entry in the meantime):

from(bucket: "telegraf/telegraf_low_granularity")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "interface")
  |> filter(fn: (r) => r["_field"] == "mean_in_bps" or r["_field"] == "mean_out_bps")
  |> filter(fn: (r) => r["ifDescr"] == "pppoe-wan")
  |> keep(columns: ["_time", "_field", "_value"])

We get

Screenshot_20230505_091921

311417 bits per second is not 3Mbps, it's 311Kbps - so why is Grafana reporting that higher value? Screenshot_20230505_092135

Actually, mean_in_bps looks about right, it's out we need to look at.

In grafana, mean_out_bps claims 2.9Mib/s. In the raw data above it's 3054988.8 bps (about 3MB/s).

Querying the raw data with

        from(bucket: "telegraf/autogen")
          |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
          |> filter(fn: (r) => r._measurement == "interface")
          |> filter(fn: (r) => r.hostname == "OpenWRT")
          |> filter(fn: (r) => r.ifDescr == "pppoe-wan")
          |> filter(fn: (r) => r._field == "ifHCInOctets" or
               r._field == "ifHCOutOctets" or
               r._field == "ifInErrors"
            )
          |> derivative(nonNegative: true)

Doesn't show anywhere near that level at 09:05

Screenshot_20230505_092701

We see much the same with the most recent datapoint (09:15) too - in the downsampled data the out rate is claimed at 2.4Mbps, but raw data shows it as much lower.

Maybe I'm just tired, but I'm confused - this looks like it has to be a bug, and yet - why's it only affecting one field?

btasker Permalink
05-May-23 09:01

OK, so before we go hunting demons in the code, I'm pulling the raw data that the job should have got for this field (dropping some columns to improve readability)

from(bucket: "telegraf/autogen", host: "http://192.168.3.84:8086")
  |> range(start: 2023-05-05T07:05:26.899945Z, stop: 2023-05-05T08:05:26.899945Z)
  |> filter(fn: (r) => r._measurement == "interface")
  |> filter(fn: (r) => r.hostname == "OpenWRT")
  |> filter(fn: (r) => r.ifDescr == "pppoe-wan")
  |> filter(fn: (r) => r._field == "ifHCOutOctets"
    )
  |> derivative(nonNegative: true)
  |> window(every: 15m)

It's the final window (table 4) that we care about in this case.

Values there are

10328.1
30922.41666
13326.2
29381.13333
15155.81666
26651.53333

So the mean that should be arrived at is (125765.19998 / 6) = 20960.86666633 (so around 21kbps).

Just for completeness's sake, checking the 08:15 window (it's a full window so has more values). Rounding out of laziness

Which when summed comes out at 297374. Dividing that by 15 gives 19824 (again, around 20Kbps).

So the data we're receiving is definitely representative of the values we're expecting to see.

If we look at max_out_bps for 08:15 it's 11923583.2 - yet that value doesn't appear in the list of values above. It's like it's been taken from an entirely different source. What makes it weirder is, it's a float - so it's not even like it's a counter somehow slipping through.

And again... why isn't it happening for the other field?

I wondered if perhaps the number was (somehow) spilling over from another interface so removed the interface filter when querying the full dataset

from(bucket: "telegraf/autogen", host: "http://192.168.3.84:8086")
  |> range(start: 2023-05-05T07:05:26.899945Z, stop: 2023-05-05T08:15:26.899945Z)
  |> filter(fn: (r) => r._measurement == "interface")
  |> filter(fn: (r) => r.hostname == "OpenWRT")
  |> filter(fn: (r) => r._field == "ifHCOutOctets"
    )
  |> derivative(nonNegative: true)
  |> filter(fn: (r) => r._value == 11923583.2)

Got no results.

If I remove the field filter, I still don't get any results.

This value has to be coming from somewhere

As if it wasn't odd enough, the latest runs don't seem to do it Screenshot_20230505_100006

Going to have to keep an eye on it, but it's going to be hard to troubleshoot if it's no longer happening (it's also still quite possible the answer is that I'm confused).

btasker Permalink
05-May-23 09:02

I've added a graph to my Internet Connection dashboard to monitor this.

sysconfiguser Permalink
05-May-23 09:03

mentioned in commit sysconfigs/downsample_configs@8f8e707e5c0dfcb0b5f3ac6d49ce1c65657181e8

Commit: sysconfigs/downsample_configs@8f8e707e5c0dfcb0b5f3ac6d49ce1c65657181e8 
Author: ben                            
                            
Date: 2023-05-05T10:03:37.000+01:00

Message

Add downsample job for utilities/python_influxdb_downsample#25

+42 -0 (42 lines changed)

btasker Permalink
08-May-23 08:55

The WAN speed graph is consistently returning correct values now.

So that one can be considered a config issue: calculating the mean of a cumulative counter gives unexpected results.

Need to look at the speed calculations next.

btasker Permalink
08-May-23 09:21

Broadband speed calculation

The last 24hrs downsampled vs raw

Screenshot_20230508_095649

That data's downsampled as part of a broader job, built on the following query

from(bucket: "telegraf/autogen" )
|> range(start: 2023-05-05T08:13:42.821222Z, stop: 2023-05-05T08:58:42.821222Z)
|> filter(fn:(r) =>  r._measurement == "data_quota" or r._measurement == "http_response" or r._measurement == "i2pd" or r._measurement == "nextcloud_quotas" or r._measurement == "pihole_stats" or r._measurement == "internet_speed" )
|> filter(fn:(r) =>  r._field != "result_type" or r._field != "status" )
|> window(every: 15m, createEmpty: true)

(taken from the logged run I did earlier in the ticket - which is why timeranges are out).

So to make troubleshooting easier, we probably want to trim that back to

from(bucket: "telegraf/autogen" )
|> range(start: 2023-05-05T08:13:42.821222Z, stop: 2023-05-05T08:58:42.821222Z)
|> filter(fn:(r) =>  r._measurement == "internet_speed" )
|> filter(fn:(r) =>  r._field != "result_type" or r._field != "status" )
|> window(every: 15m, createEmpty: true)

If I run that in Grafana quickly, we get an interesting set of results:

Screenshot_20230508_100118

So if we take that first column, we've got a single value of 48.7, but we've got two rows. So, we'd calculate the mean as 48.7/2 and effectively halve it.

But, if we switch to using the Influx UI (draws better tables)

Screenshot_20230508_100515

We can see that it might be a reporting issue.

Oddly though, if we add a mean() call to it...

Screenshot_20230508_100920

So, we're probably 0'ing those empty windows, which then can factored into the means when graphing later.

We can prove that by querying the data out:

Screenshot_20230508_101325

So there's the source of the skew. We can probably blame #20 for that.

btasker Permalink
08-May-23 10:20

There are two possible fixes for the broadband speed reports then:

set window_create_empty to False on the downsample job
Adjust the dashboard query to filter out 0 values

The correct answer is probably to do both (as it'll fix graphing of the data we already have, as well as correcting future downsamples).

It affects this job because the sampling interval is roughly the same as that of the windowing period, so you can end up with weird offsets.

Update:

Disabling window_create_empty had the desired effect for new incoming data Screenshot_20230508_121304

Whilst adding a filter to strip the 0's corrected the historic data

Screenshot_20230508_121405

btasker Permalink
08-May-23 10:28

mentioned in issue #27

btasker Permalink
08-May-23 11:10

So, as it turns out, these were both configuration issues.

Although they were valid things to investigate, closing as Invalid as they were not code issues

sysconfiguser Permalink
08-May-23 11:18

mentioned in commit sysconfigs/downsample_configs@874f0379a3995e398df0f1f0ba88220886c45e12

Commit: sysconfigs/downsample_configs@874f0379a3995e398df0f1f0ba88220886c45e12 
Author: ben                            
                            
Date: 2023-05-08T12:15:09.000+01:00

Message

Disable creation of empty windows

This was the fix in utilities/python_influxdb_downsample#25

With this enabled, we periodically insert a 0 value for internet speeds, which skews calculation of mean() over time.

+2 -0 (2 lines changed)

btasker Permalink
08-May-23 11:27

mentioned in issue #20

utilities/python_influxdb_downsample#25: Unexpected results for some datasets

Issue Information

Activity