project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#18: Image endpoint



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: vnext
Created: 03-Jul-22 06:52



Description

A lot of analytics systems have a non-javascript fallback built around a tracking pixel.

The idea being that if javascript is blocked/available the browser instead fetches an image and the server collects request metadata (IP, user-agent etc etc)

I'd like to add something similar but with a much, much narrower scope of collection.

A PFA implementation of this should only collect the referring domain and (where possible) page. In effect, it shouldn't be much more than a hitcounter.



Issue Links

Toggle State Changes

Activity


assigned to @btasker

One of the reasons that I'm interested in building this is so that I can see what percentage of visitors have the main analytics blocked (and how that ratio varies between clearnet, tor and i2p users).

It'll also help identify whether PFA is an effective means for delivering the protections that I also use it for - for example, would doFakeOnionThing() protect more users if it was delivered via other routes separately from analytics?

To implement this, we probably need to come up with a server-side means of increasing cardinality without impacting upon privacy.

If we only tag with domain and page then our series key will be

$measurement_name,domain=www.bentasker.co.uk,page=/,counter

If there are multiple simultaneous views with the same series key, the writes will effectively upsert one another and we'll end up with a count of 1 rather than 4 (or however many there actually were).

This could be mitigate a little by writing the user-agent (or something derived from it) in as a tag

$measurement_name,domain=www.bentasker.co.uk,page=/,browser=firefox,platform=win32,counter

But there are a couple of issues with this approach

  1. By blocking the analytics agent/endpoint, the user has signalled that they don't want data about them collected. It seems wrong to disregard that to solve a technical challenge on our side
  2. It only mitigates the issue a little - if there were simultaneous views by users using the same browser/platform we run into the same thing

We could similarly mitigate by using some or part of the user's subnet, but that runs up against point 1 too (and, if anything, more severely).

In #6 we adjusted the agent to generate a UUID for use as a session identifier to address a similar issue relating to high concurrency.

I think that's the best approach to follow - it'll mean extremely high cardinality in the raw data, but that identifier can be stripped when the data is downsampled.

Commit f2762510 implemented the agent-side logic, but we can't use that here - we need the calculation to be entirely server side.

verified

mentioned in commit 54971bd8bf662bd1b4f639e40c72f7562214980e

Commit: 54971bd8bf662bd1b4f639e40c72f7562214980e 
Author: B Tasker                            
                            
Date: 2022-07-03T12:11:52.000+01:00 

Message

Implement support for a hit-count pixel (websites/privacy-sensitive-analytics#18)

This implementation takes information from the Referer header in order to record a hit count for a scheme + domain + page tuple.

To ensure hits don't overwrite one another, an identifier is created using only Nginx's information about the request:

local sess_id_components = {
        ngx.var.connection,
        ngx.var.connection_requests,
        ngx.var.pid
}

Nothing in the identifier identifies the user themselves, and the identifier is not guaranteed to be globally unique (nor does it need to be)

+131 -10 (141 lines changed)
verified

mentioned in commit a75eb112c47ed5f3b2d1c562c41ba47695cb478c

Commit: a75eb112c47ed5f3b2d1c562c41ba47695cb478c 
Author: B Tasker                            
                            
Date: 2022-07-03T12:22:38.000+01:00 

Message

Have the counter return the expected gif (websites/privacy-sensitive-analytics#18)

It turns out that Nginx has a built in module (empty_gif) so we use that to return the image

+11 -2 (13 lines changed)
verified

mentioned in commit e3886600451b144116f744ec8143ae7b6d6ff258

Commit: e3886600451b144116f744ec8143ae7b6d6ff258 
Author: B Tasker                            
                            
Date: 2022-07-03T12:27:39.000+01:00 

Message

Enable writes into the upstream for websites/privacy-sensitive-analytics#18

+1 -1 (2 lines changed)
verified

mentioned in commit d55f6330040e135885ad6c9b065ec9818028a80d

Commit: d55f6330040e135885ad6c9b065ec9818028a80d 
Author: B Tasker                            
                            
Date: 2022-07-03T12:25:23.000+01:00 

Message

Move to using rewrite_by_lua rather than header_filter_by_lua (websites/privacy-sensitive-analytics#18)

One of the API's that resty.http relies on isn't available within the context of header_filter_by_lua so we need to use the other method instead

+1 -1 (2 lines changed)

OK, so the server side logic is built, and I've got a copy of it currently active.

There are two methods by which this can be enabled/deployed within a site.

The first is to just direct link it

<img src="https://pfanalytics.bentasker.co.uk/count.gif">

The second is to update a site's config to serve it from a local path and proxy it through

location /count.gif {

    proxy_pass https://pfanalytics.bentasker.co.uk;
    proxy_set_header Referer $http_referer;
    proxy_set_header Host pfanalytics.bentasker.co.uk;
}

And them embed with

<img src="/count.gif">

Whilst more complex, this approach means you can return sensible Refferer-Policy headers without knackering the efficiency of the site's analytics.

The counter will be triggered whether the user uses the JS agent or not, so if we're interested in how many users didn't have JS we need to do some maths.

The following Flux query will subtract the JS agent derived count from that calculated by the hit counter

hc = from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_pixel")
  |> filter(fn: (r) => r.domain == v.domain)
  |> filter(fn: (r) => r._field == "count")
  |> group()
  |> count()
  // Inject a fake timestamp for use when joining
  |> map(fn: (r) => ({r with _time: 1234}))


js = from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test")
  |> filter(fn: (r) => r.domain == v.domain)
  |> filter(fn: (r) => r._field == "response_time")
  |> filter(fn: (r) => r.action != "\"ready\"")
  |> group()
  |> count()
  // Inject a fake timestamp for use when joining
  |> map(fn: (r) => ({r with _time: 1234}))


join(tables: {js: js, hc: hc}, on: ["_time"])
  // Calculate the delta
  |> map(fn: (r) => ({_value: r._value_hc - r._value_js}))

The figure tells us how many hit the hitcounter but didn't write stats in.

It's not perfect - if there are no JS derived stats, then no results will be returned instead of a number. We need join.tables for that, but I'm running InfluxDB 1.8.10 and that predates that Flux package.

Most of the time though, there are going to be at least a few users with JS enabled, so it's not an issue I expect to run into regularly.

The data can be downsampled with the following Flux task

option task = {
    name: "downsample_hitcounter",
    every: 15m,
    offset: 1m,
    concurrency: 1,
}


out_bucket = "websites/analytics"
host="http://192.168.3.84:8086"
token=""

sourcedata = from(bucket: "telegraf/autogen", host: host, token: token)
    |> range(start: -task.every)
    |> filter(fn: (r) => r._measurement == "pf_analytics_test_pixel")
    |> drop(columns: ["sess"])
    |> aggregateWindow(every: 15m, fn: sum)
    |> map(fn: (r) => ({ r with
                _field: "hitcount",
        _measurement: "pf_analytics_pixel"
    }))
    |> drop(columns: ["_start", "_stop", "type"])
    |> to(bucket: out_bucket, host: host, token: token)

I've created a Wiki page describing this functionality.

mentioned in issue jira-projects/CDN#22

mentioned in issue #21

marked this issue as related to #21