project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#1: Design Privacy Sensitive Analytics system



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: 0.1
Created: 15-Dec-21 16:27



Description

I recently moved www.bentasker.co.uk over from using Joomla to a static site generator (websites/BEN#3).

When doing the move, I didn't implement any analytics hooks and have since decommed the Piwik/Matamao system I was using (jira-projects/CDN#8)

However, that's come at the cost of not having a good overview of where traffic is coming from (or even where it's going to - most of my sites are served via CDN and I deliberately do not receive full access logs).

What I'd like is an analytics system which can collect some of this information, without overcollecting.

So, we want to capture

  • What page they hit
  • Their referrer (if there was one)
  • When they hit
  • Platform (mobile, desktop)


Toggle State Changes

Activity


assigned to @btasker

I do find it interesting to see where users are located, at least at the country level.

We can't geolocate them without calling external services (or submitting the IP as part of the payload/getting direct connections) though.

But, we could use some JS to extract the timezone they've got set - that's probably granular enough to satisfy my curiosity and helps reduce cardinality

My current thinking is:

  • Small JS client
  • LUA on Mikasa to parse the input and translate to lineprotocol
  • Telegraf on Mikasa with InfluxDB listener active

That way writes can be batched off-net but still ultimately written to my local InfluxDB. Having the LUA in the middle helps guard against potential malicious input, as well as allowing type to be enforced.

mentioned in issue jira-projects/CDN#8

So, as it's simplest, we might have the JS post a JSON payload, something like

{
"page" : "/foo/bar",
"domain" : "www.bentasker.co.uk",
"referrer" : "https://foo.bar/sed",
"utcoffset" : "0",
"platform" : "Linux x86_64"
}

Looks like we can get the TZ with

new Date().getTimezoneOffset();

And Platform can be pulled from navigator.platform

verified

mentioned in commit b2a3bdc00d7b9af8adc7b94cc5e22943db6efeda

Commit: b2a3bdc00d7b9af8adc7b94cc5e22943db6efeda 
Author: B Tasker                            
                            
Date: 2021-12-15T19:37:31.000+00:00 

Message

We have a working PoC for websites/privacy-sensitive-analytics#1

This implements

An agent which collects

  • Source website hostname
  • Source page path
  • Referrer
  • Browser platform
  • Browser timezone
  • Time taken to load page

The agent submits via a simple JSON payload, received via an Openresty server block and processed by LUA.

LUA processes the JSON and reformats into InfluxDB line protocol for writing into either Telegraf or InfluxDB

+140 -0 (140 lines changed)

Closing this as done - v0.1 has been released