project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#19: Events API



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: vnext
Created: 14-Aug-22 10:33



Description

I think it'd be useful to add a simple events API.

As an example, I recently (re)added social media share buttons to my site (https://www.bentasker.co.uk/posts/documentation/general/adding-sm-share-icons-to-a-nikola-site-template.html), so it'd be quite useful to log an event when they're clicked to see what kind of usage they get.

Similarly, I added something a little while back to intercept problematic searches and redirect the user to a section of this page (https://www.bentasker.co.uk/posts/blog/general/695-an-analysis-of-search-terms-used-on-bentasker-co-uk.html). Again, it'd be useful to be able to fire/log an event to show that that's happened.



Toggle State Changes

Activity


assigned to @btasker

There are some concerns we'd need to address here around cardinality.

The event type is going to need to be a tag. So, obviously we need to be a little careful about the event types that we add.

But, we also need to ensure that a malicious user can't drive up cardinality by spamming nonsense events into the system - the server side should probably maintain a whitelist of accepted events.

As a quick fag packet design though, we probably want to capture

  • domain
  • page
  • Event type
  • Sess ID (if enabled)

It might also be prudent to capture platform and timezone - partly to maximise the chances of the series key being unique to that user, but also so it's possible to identify whether a given event is more common amongst a certain portion of a userbase.

verified

mentioned in commit 1f8fbcd564952c6211d434c45877ba7706763246

Commit: 1f8fbcd564952c6211d434c45877ba7706763246 
Author: B Tasker                            
                            
Date: 2022-08-14T11:46:59.000+01:00 

Message

Start building interface for events API (websites/privacy-sensitive-analytics#19)

This introduces a new method in the agent - recordEvent(event_name, note)

The argument note is optional, but if provided will be inserted as a text field.

+57 -0 (57 lines changed)

I've created a function in the agent, usage is as follows

recordEvent("bentest")
recordEvent("bentest2","this will be recorded as a note against the event")

The first results in the following post request body

{
"domain":"www.bentasker.co.uk",
"page":"/posts/blog/general/695-an-analysis-of-search-terms-used-on-bentasker-co-uk.html",
"timezone":-60,
"platform":"Linux x86_64",
"sess_id":"07ab-8022",
"event":"bentest",
"note":""
}

The Nginx config has been updated to add support (so that we get valid CORS headers back), but there's currently no processing support for these requests

verified

mentioned in commit 4f8bacfc1024b8492d8f26c61438b29911fa3b35

Commit: 4f8bacfc1024b8492d8f26c61438b29911fa3b35 
Author: B Tasker                            
                            
Date: 2022-08-14T12:04:19.000+01:00 

Message

Implement handling of event submissions for websites/privacy-sensitive-analytics#19

Whilst it works, this isn't currently safe to deploy: it doesn't restrict events to a named set, that'll be added shortly

+70 -20 (90 lines changed)
verified

mentioned in commit ee2e8ef3bcb364197f792db6a50b824a957ed7a3

Commit: ee2e8ef3bcb364197f792db6a50b824a957ed7a3 
Author: B Tasker                            
                            
Date: 2022-08-14T12:08:13.000+01:00 

Message

Add config var $permitted_events for websites/privacy-sensitive-analytics#19

This config option provides a comma-delimited list of event names which may be recorded via the API.

Event names should be lowercased, as they're submitted onwards in lowecase

+12 -0 (12 lines changed)

This seems to work, I've tested by adding the following

if (typeof(recordEvent) == "function"){
    recordEvent("searchTermIntercepted", term);
}

And the events get logged as they should

A count of events can be graphed out with

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "pf_analytics_event")
  |> filter(fn: (r) => r._field == "counter")
  |> filter(fn: (r) => r.event == "searchtermintercepted")
  |> group(columns: ["domain"])
  |> aggregateWindow(every: v.windowPeriod, fn: sum)

For a slightly more complex one, the following query graphs out the number of clicks per social share button

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "pf_analytics_event")
  |> filter(fn: (r) => r._field == "counter" or r._field == "note")
  |> filter(fn: (r) => r.event == "socialclick")
  |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
  |> rename(columns: {counter: "_value"})
  |> group(columns: ["domain", "note"])
  |> aggregateWindow(every: v.windowPeriod, fn: sum)

Downsampling can be achieved with a script like this

option task = {
    name: "downsample_pfanalytics_events",
    every: 15m,
    offset: 1m,
    concurrency: 1,
}

out_bucket = "websites/analytics"
host="http://192.168.3.84:8086"
token=""

sourcedata = from(bucket: "telegraf/autogen", host: host, token: token)
    |> range(start: -task.every)
    |> filter(fn: (r) => r._measurement == "pf_analytics_event")
    |> filter(fn: (r) => r.event != "socialclick")
    |> filter(fn: (r) => r._field == "counter")
    |> drop(columns: ["_start", "_stop", "type", "sess"])
    |> aggregateWindow(every: 15m, fn: sum)
    |> to(bucket: out_bucket, host: host, token: token)

sourcedata2 = from(bucket: "telegraf/autogen", host: host, token: token)
    |> range(start: -task.every)
    |> filter(fn: (r) => r._measurement == "pf_analytics_event")
    |> filter(fn: (r) => r.event == "socialclick")
    |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
    |> rename(columns: {counter: "_value", note : "clicked"})
    |> drop(columns: ["_start", "_stop", "type", "sess", "counter"])
    // Because we pivoted we need to restore the group key so that 
    // the aggregate doesn't strip it
    |> group(columns: ["domain","event","page","platform","timezone","clicked","_measurement"])
    |> aggregateWindow(every: 15m, fn: sum)
    |> set(key: "_field", value: "counter")
    |> to(bucket: out_bucket, host: host, token: token)

The first query handles simple counters, the second handles events where we want the note to become a group key (in this case, which social network share button was clicked).