#10 Bad Domain investigative reporting : websites/privacy-sensitive-analytics#10

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Websites / Privacy Sensitive Analytics

Milestone: 0.2

Created: 28-Mar-22 09:33

Labels: New Feature Resolution::Fixed/Done

Description

Had, err, fun over the weekend (jira-projects/CDN#17).

I'd like to make adjustments to the agent so that we collect some additional information when a bad domain is detected.

Not just because of that issue (it's a serious edge-case), but also because I quite regularly see videos.bentasker.co.uk appear in the bad domains list, and I've no idea why. So, I'd like to update the agent to try and tell me a bit more in both cases.

The aim is to identify the source of the traffic not the user/visitor.

Currently we capture

time
page path
domain
platform
referrer (if available)
referrer domain (if available)
response time
timezone

I'd like to expand that data-set for bad domains:

Page title
User-agent
Page HTML (maybe?)
Referrer

I've added Referrer into the second list because I think, for a bad domain, we should change the way we handle it. Ordinarily, we blank it if the referrer is on the same domain

    var referrer_domain = '';
    var referrer = document.referrer;
    if (referrer.startsWith(document.location.protocol + "//" + document.location.hostname)){
        // In-site navigation, blank the referrer - we're not looking to stalk users around the site
        referrer = '';
    }

But, where the site has been mirrored (for example, that Cellebrite issue) there's a good chance everything will be within a single domain. So, it'd be useful to remove this restriction for bad domains.

Toggle State Changes

Activity

btasker Permalink
28-Mar-22 09:33

assigned to @btasker

btasker Permalink
28-Mar-22 09:40

I've got mixed feelings about capturing the Page HTML - it's one of those things that's a waste of bandwidth/storage right up until it isn't.

It would have been interesting in jira-projects/CDN#17 to see what they were viewing, similarly it'd be useful to see what a clickjacker is serving to others and it might help me figure out why videos.bentasker.co.uk keeps showing up in the list.

We can grab it with document.documentElement.outerHTML, though might want to think about compressing it (the challenge there though is then decompressing it to make it usable in reporting).

btasker Permalink
28-Mar-22 10:01

If we do capture page HTML, I'd like to be able to configure/define exclusions to this.

For example, we know there will be lots of onion.ws and onion.ly in there (because Tor2Web). There's precious little benefit in sending page HTML for most of those.

btasker Permalink
28-Mar-22 10:10

As a low effort solution for compression, I guess we use this - https://github.com/pieroxy/lz-string/blob/master/libs/lz-string.js - and then base64 it (although btoa doesn't seem to like its output).

btasker Permalink
28-Mar-22 12:49

verified

mentioned in commit 09a4797baae12c6b6be95c5955f494967841d552

Commit: 09a4797baae12c6b6be95c5955f494967841d552 
Author: B Tasker                            
                            
Date: 2022-03-28T13:48:34.000+01:00

Message

Implement collection of additional information when a bad domain is detected - see websites/privacy-sensitive-analytics#10

+171 -6 (177 lines changed)

btasker Permalink
28-Mar-22 14:17

At this point, collection works but we end up trying to write fairly massive blobs into the DB.

I decided it's better to write the files to disk on Mikasa and then simply push a URL to view the image at.

root@mikasa:/etc/nginx/domains.d# mkdir /mnt/images
root@mikasa:/etc/nginx/domains.d# chown www-data:www-data /mnt/images/

btasker Permalink
28-Mar-22 14:21

OK, this has tested OK, so I think we roll a release and then look at improving reporting in the next version.

btasker Permalink
28-Mar-22 14:31

verified

mentioned in commit ae06788c235a481d6daf180f9815bf20096eee26

Commit: ae06788c235a481d6daf180f9815bf20096eee26 
Author: B Tasker                            
                            
Date: 2022-03-28T15:16:56.000+01:00

Message

Write screenshot to disk rather than trying to push it upstream. See websites/privacy-sensitive-analytics#10

+13 -1 (14 lines changed)

btasker Permalink
28-Mar-22 14:53

For now, can query out with

import "strings"

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_enhanced_unauth")
  |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
  |> filter(fn: (r) => strings.containsStr(v: r.domain, substr: v.domain_name))
  |> keep(columns: ["_time","page","domain","page_title","referrer","timezone","screenshot","timezone"])
  |> group()
  |> sort()

#15	Bad Domains reporting should handle file:// scheme
#14	Remove screen capture ability

websites/privacy-sensitive-analytics#10: Bad Domain investigative reporting

Issue Information

Issue Links

Activity