project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#3: Bad domain list



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: 0.1
Created: 18-Dec-21 09:49



Description

Currently, the LUA doesn't apply any filtering on the submitting domain - so requests could come in claiming to relate to "foobar", and foobar would end up being a filter.

Should add a way to whitelist domains on the LUA side (preferably through Nginx config) - if requests come in for some other domain, I'd like those recorded seperately (so there's a way to investigate them, without tripping up the core analytics)



Toggle State Changes

Activity


assigned to @btasker

verified

mentioned in commit 4a2a0ebdea1f01ecde09877e574ff9adcdae4012

Commit: 4a2a0ebdea1f01ecde09877e574ff9adcdae4012 
Author: B Tasker                            
                            
Date: 2021-12-18T10:52:33.000+00:00 

Message

Only write stats for whitelisted domains for websites/privacy-sensitive-analytics#3

+49 -1 (50 lines changed)
verified

mentioned in commit 48259ea8c40e516598cb10787b7bf1f3176e14c3

Commit: 48259ea8c40e516598cb10787b7bf1f3176e14c3 
Author: B Tasker                            
                            
Date: 2021-12-18T10:56:51.000+00:00 

Message

Update the example Nginx config to include the settings for websites/privacy-sensitive-analytics#3

+8 -1 (9 lines changed)
verified

mentioned in commit 24a62aad616c729cf317d7eb007282aa11c66597

Commit: 24a62aad616c729cf317d7eb007282aa11c66597 
Author: B Tasker                            
                            
Date: 2021-12-18T10:54:32.000+00:00 

Message

Add support for a list of domains to skip for websites/privacy-sensitive-analytics#3

The underlying idea being that there may be domains that we don't want to waste our time writing into the unauthorised measurement - just reject them upfront

+6 -0 (6 lines changed)

This change requires that a couple of variables be set in the nginx config

# Comma seperated list
set $permitted_domains '';

# Comma seperated list
set $skip_domains '';

If a domain isn't in permitted_domains and isn't in skip_domains then a record will be written into a separate measurement ($measurement_unauth) to allow visibility for further investigation.

If a domain is in skip_domains that won't happen - the idea being that if there's a repeat offender we might want to suppress them.

To mitigate the impact of this junk traffic on cardinality, there are very few tags in the _unauth writes - the expectation is that reports will pivot as necessary to pull out the information

The following flux will run off a list of bad domains and how many page views they resulted in

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_unauth" and r._field == "domain")
  |> map(fn: (r) => ({ ref: r._value, views: 1 }))
  |> group(columns: ["ref"])
  |> sum(column: "views")
  |> sort(columns: ["views"], desc: true)