project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#12: Implement bad-domains reporting and utilities



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: V0.3
Created: 29-Mar-22 07:40



Description

#10 implemented capturing of enhanced information when a bad domain is detected.

We need to develop reporting around that - I currently have a dashboard but it's loosely put together and needs refining.

Commit b1911bf2 implemented a utility for viewing the generated screenshots, but again there's definitely some refinement possible



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 58613d40600aeb67b12f7f4f420cb680230aeb3d

Commit: 58613d40600aeb67b12f7f4f420cb680230aeb3d 
Author: B Tasker                            
                            
Date: 2022-04-03T11:20:44.000+01:00 

Message

Downsample bad-domains stats ready for reporting in websites/privacy-sensitive-analytics#12

+38 -2 (40 lines changed)

We now capture stats about the number of bad domain requests.

Number of writes originating from a bad domain:

from(bucket: "telegraf/autogen")
  |> range(start: ''' + START + ''')
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_unauth" and r._field == "domain")
  |> map(fn: (r) => ({ _time: r._time, ref: r._value, _value: 1, host: r.host }))
  |> group()
  |> aggregateWindow(every: 15m, fn: sum)

Number of unique bad domains observed in period

from(bucket: "telegraf/autogen")
  |> range(start: ''' + START + ''')
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_unauth" and r._field == "domain")
  |> map(fn: (r) => ({ _time: r._time, _value: r._value, host: r.host}))
  |> window(every: 15m)
  |> distinct()
  |> count()
  |> map(fn: (r) => ({ _time: r._stop, host: r.host, _value: r._value}))
  |> group()

That ones actually currently probably less useful than it sounds. In most periods it'll be 1 or 2, and there's no reliable way to use it to extrapolate out to later periods (were the 3 in the next hour the same bad-domain or different ones).

If we want to be able to capture that, we'd need to write the actual domain in as a field or tag value. The problem with that is you either risk runaway cardinality, or accept that the data's not really downsampled (because we'd need to write a point per hit).

Will give that one some more thought

verified

mentioned in commit 4ea7009281ef11d091d4cc099e652a3c8c9b6586

Commit: 4ea7009281ef11d091d4cc099e652a3c8c9b6586 
Author: B Tasker                            
                            
Date: 2022-04-03T11:46:03.000+01:00 

Message

Write a list of observed bad domains and associated pages into long term storage for websites/privacy-sensitive-analytics#12

We write these as a field value to ensure that a slew of dodgy domains doesn't impact our cardinality

+34 -0 (34 lines changed)

We now write a list of the domains in, so I've added a graph to the dashboard that shows the number of unique bad domains in the time period - it uses better Flux than the downsampling version above

from(bucket: "websites/analytics")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_bad_domains" 
                       and r._field == "bad_domain_domains")
  |> aggregateWindow(every: v.windowPeriod, 
  fn: (column, tables =<-) 
          => tables 
                |> distinct() 
                |> count(),
  createEmpty: true)

verified

mentioned in commit 11820d97dd320733791e45cf90836c5471c51efd

Commit: 11820d97dd320733791e45cf90836c5471c51efd 
Author: B Tasker                            
                            
Date: 2022-04-03T12:01:25.000+01:00 

Message

Update historic dashboard to give bad domain stats for websites/privacy-sensitive-analytics#12

+249 -15 (264 lines changed)

We have reporting in place

Commit b1911bf2 implemented a utility for viewing the generated screenshots, but again there's definitely some refinement possible

I've since opted to remove the screenshot functionality (#14) so there's nothing extra to do here.