A lot of analytics systems have a non-javascript fallback built around a tracking pixel.
The idea being that if javascript is blocked/available the browser instead fetches an image and the server collects request metadata (IP, user-agent etc etc)
I'd like to add something similar but with a much, much narrower scope of collection.
A PFA implementation of this should only collect the referring domain and (where possible) page. In effect, it shouldn't be much more than a hitcounter.
#21 | Disable the image endpoint |
Activity
03-Jul-22 06:52
assigned to @btasker
03-Jul-22 06:57
One of the reasons that I'm interested in building this is so that I can see what percentage of visitors have the main analytics blocked (and how that ratio varies between clearnet, tor and i2p users).
It'll also help identify whether PFA is an effective means for delivering the protections that I also use it for - for example, would
doFakeOnionThing()
protect more users if it was delivered via other routes separately from analytics?03-Jul-22 10:30
To implement this, we probably need to come up with a server-side means of increasing cardinality without impacting upon privacy.
If we only tag with domain and page then our series key will be
If there are multiple simultaneous views with the same series key, the writes will effectively upsert one another and we'll end up with a count of 1 rather than 4 (or however many there actually were).
This could be mitigate a little by writing the user-agent (or something derived from it) in as a tag
But there are a couple of issues with this approach
We could similarly mitigate by using some or part of the user's subnet, but that runs up against point 1 too (and, if anything, more severely).
In #6 we adjusted the agent to generate a UUID for use as a session identifier to address a similar issue relating to high concurrency.
I think that's the best approach to follow - it'll mean extremely high cardinality in the raw data, but that identifier can be stripped when the data is downsampled.
Commit f2762510 implemented the agent-side logic, but we can't use that here - we need the calculation to be entirely server side.
03-Jul-22 11:14
mentioned in commit 54971bd8bf662bd1b4f639e40c72f7562214980e
Message
Implement support for a hit-count pixel (websites/privacy-sensitive-analytics#18)
This implementation takes information from the Referer header in order to record a hit count for a scheme + domain + page tuple.
To ensure hits don't overwrite one another, an identifier is created using only Nginx's information about the request:
Nothing in the identifier identifies the user themselves, and the identifier is not guaranteed to be globally unique (nor does it need to be)
03-Jul-22 11:23
mentioned in commit a75eb112c47ed5f3b2d1c562c41ba47695cb478c
Message
Have the counter return the expected gif (websites/privacy-sensitive-analytics#18)
It turns out that Nginx has a built in module (
empty_gif
) so we use that to return the image03-Jul-22 11:27
mentioned in commit e3886600451b144116f744ec8143ae7b6d6ff258
Message
Enable writes into the upstream for websites/privacy-sensitive-analytics#18
03-Jul-22 11:27
mentioned in commit d55f6330040e135885ad6c9b065ec9818028a80d
Message
Move to using
rewrite_by_lua
rather thanheader_filter_by_lua
(websites/privacy-sensitive-analytics#18)One of the API's that
resty.http
relies on isn't available within the context ofheader_filter_by_lua
so we need to use the other method instead03-Jul-22 11:32
OK, so the server side logic is built, and I've got a copy of it currently active.
There are two methods by which this can be enabled/deployed within a site.
The first is to just direct link it
The second is to update a site's config to serve it from a local path and proxy it through
And them embed with
Whilst more complex, this approach means you can return sensible
Refferer-Policy
headers without knackering the efficiency of the site's analytics.03-Jul-22 12:25
The counter will be triggered whether the user uses the JS agent or not, so if we're interested in how many users didn't have JS we need to do some maths.
The following Flux query will subtract the JS agent derived count from that calculated by the hit counter
The figure tells us how many hit the hitcounter but didn't write stats in.
It's not perfect - if there are no JS derived stats, then
no results
will be returned instead of a number. We needjoin.tables
for that, but I'm running InfluxDB 1.8.10 and that predates that Flux package.Most of the time though, there are going to be at least a few users with JS enabled, so it's not an issue I expect to run into regularly.
03-Jul-22 12:59
The data can be downsampled with the following Flux task
04-Jul-22 12:01
I've created a Wiki page describing this functionality.
05-Jul-22 08:05
mentioned in issue jira-projects/CDN#22
17-Dec-23 10:48
mentioned in issue #21
17-Dec-23 10:51
marked this issue as related to #21