project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#6: Handling high concurrency



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: 0.1
Created: 18-Dec-21 18:41



Description

At the moment, we can handle 1 analytic write per domain+page, per second.

The reason for this is that our series only consists of those elements

        measurement,
        "domain=" .. string.lower(post['domain']),
        'type="pageview"',
        'page="' .. post['page'] .. '"'
        }

So, if two users hit www.example.com/foo in the same second, Influx will overwrite the field values of the first point with the second.

That's not likely to be too big a drama with my sites most of the time, but I'd like to have it handled (especially as it's tricky to address - details in comments to follow).



Toggle State Changes

Activity


assigned to @btasker

Normally, you'd include something unique to the point in the series - a sensor ID, the user's subnet etc.

That's not directly possible here, and is quite challenging to address, because we're aiming to try and collect data in a way that doesn't allow the operator to track user's through a session (and certainly not between sessions).

We're also trying to keep cardinality down, and having an id unique to the user obviously runs up against that. That's less of an issue in that it can, at least, be addressed when downsampling.

We can't use a time derived identifier - both requests would derive the same.

There are a couple of ways around this

  • generate a session based psuedonymous identifier and include that as a tag
  • Switch to use ms precision

The former has obvious privacy implications. The latter is only really moving the goalposts, and potentially not by much - we use ngx.now() which uses Nginx's cached time - so two requests could conceivably still get the same timestamp.

I suspect, ultimately, the answer will be that we need to implement the identifier, but I'll implement the less privacy invasive approach first.

verified

mentioned in commit b54df01ce488e4ad8ab37e6d37e5ee7454c7cae2

Commit: b54df01ce488e4ad8ab37e6d37e5ee7454c7cae2 
Author: B Tasker                            
                            
Date: 2021-12-18T18:56:33.000+00:00 

Message

Switch to using ms to allow for higher concurrency - websites/privacy-sensitive-analytics#6

This reduces the likelihood of two logs for the same page getting the same timestamp (leading to one overwriting the other).

It does not fully remove it

+2 -2 (4 lines changed)
verified

mentioned in commit f2762510fa3e8995f71960135643b53209e641f8

Commit: f2762510fa3e8995f71960135643b53209e641f8 
Author: B Tasker                            
                            
Date: 2021-12-18T19:11:58.000+00:00 

Message

Implement the ability to use a session specific psuedo-identifier for websites/privacy-sensitive-analytics#6

This is for use when it's expected there will be more than 1 hit per-page, per millisecond.

It does have a privacy impact (as a user's browsing within a session can be identified) and it does impact cardinality at the storage end

+34 -0 (34 lines changed)

I've implemented both

  • By default, millisecond precision will be used
  • If the JS variable window.analytics_gen_psuedoid is true then a pseudo-id will be included

The latter is off by default, and if enabled does come with some impact:

  • There are privacy implications: a user's browsing can be tracked for as long as the browser tab session persists
  • It leads to increased cardinality at the storage end - any downsampling tasks should absolutely strip the identifier

The following flux will query out referring domains and transform the domain into a tag

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test")
  |> filter(fn: (r) => r._field == "platform")
  |> map(fn: (r) => ({ r with _field: r._value}))
  |> drop(columns: ["sess_id"])
  |> aggregateWindow(every: 15m, fn: count)
  |> map(fn: (r) => ({r with platform: r._field, _field: "viewcount"}))  

But, I'm running Influx 1.8, and Flux doesn't support to() - need 2.x for that.

It's not the end of the world though - I had been toying with the idea of having a python script that runs periodically and generates/mails a report, so that could quite easily run some flux to query aggregates out for the report and then write them in.

mentioned in issue #18