At the moment, we can handle 1 analytic write per domain+page, per second.
The reason for this is that our series only consists of those elements
measurement,
"domain=" .. string.lower(post['domain']),
'type="pageview"',
'page="' .. post['page'] .. '"'
}
So, if two users hit www.example.com/foo
in the same second, Influx will overwrite the field values of the first point with the second.
That's not likely to be too big a drama with my sites most of the time, but I'd like to have it handled (especially as it's tricky to address - details in comments to follow).
Activity
18-Dec-21 18:41
assigned to @btasker
18-Dec-21 18:51
Normally, you'd include something unique to the point in the series - a sensor ID, the user's subnet etc.
That's not directly possible here, and is quite challenging to address, because we're aiming to try and collect data in a way that doesn't allow the operator to track user's through a session (and certainly not between sessions).
We're also trying to keep cardinality down, and having an id unique to the user obviously runs up against that. That's less of an issue in that it can, at least, be addressed when downsampling.
We can't use a time derived identifier - both requests would derive the same.
There are a couple of ways around this
ms
precisionThe former has obvious privacy implications. The latter is only really moving the goalposts, and potentially not by much - we use
ngx.now()
which uses Nginx's cached time - so two requests could conceivably still get the same timestamp.18-Dec-21 18:52
I suspect, ultimately, the answer will be that we need to implement the identifier, but I'll implement the less privacy invasive approach first.
18-Dec-21 19:13
mentioned in commit b54df01ce488e4ad8ab37e6d37e5ee7454c7cae2
Message
Switch to using ms to allow for higher concurrency - websites/privacy-sensitive-analytics#6
This reduces the likelihood of two logs for the same page getting the same timestamp (leading to one overwriting the other).
It does not fully remove it
18-Dec-21 19:13
mentioned in commit f2762510fa3e8995f71960135643b53209e641f8
Message
Implement the ability to use a session specific psuedo-identifier for websites/privacy-sensitive-analytics#6
This is for use when it's expected there will be more than 1 hit per-page, per millisecond.
It does have a privacy impact (as a user's browsing within a session can be identified) and it does impact cardinality at the storage end
18-Dec-21 19:15
I've implemented both
window.analytics_gen_psuedoid
istrue
then a pseudo-id will be includedThe latter is off by default, and if enabled does come with some impact:
18-Dec-21 20:07
The following flux will query out referring domains and transform the domain into a tag
But, I'm running Influx 1.8, and Flux doesn't support
to()
- need 2.x for that.It's not the end of the world though - I had been toying with the idea of having a python script that runs periodically and generates/mails a report, so that could quite easily run some flux to query aggregates out for the report and then write them in.
03-Jul-22 10:30
mentioned in issue #18