project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#10: Bad Domain investigative reporting



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: 0.2
Created: 28-Mar-22 09:33



Description

Had, err, fun over the weekend (jira-projects/CDN#17).

I'd like to make adjustments to the agent so that we collect some additional information when a bad domain is detected.

Not just because of that issue (it's a serious edge-case), but also because I quite regularly see videos.bentasker.co.uk appear in the bad domains list, and I've no idea why. So, I'd like to update the agent to try and tell me a bit more in both cases.

The aim is to identify the source of the traffic not the user/visitor.

Currently we capture

  • time
  • page path
  • domain
  • platform
  • referrer (if available)
  • referrer domain (if available)
  • response time
  • timezone

I'd like to expand that data-set for bad domains:

  • Page title
  • User-agent
  • Page HTML (maybe?)
  • Referrer

I've added Referrer into the second list because I think, for a bad domain, we should change the way we handle it. Ordinarily, we blank it if the referrer is on the same domain

    var referrer_domain = '';
    var referrer = document.referrer;
    if (referrer.startsWith(document.location.protocol + "//" + document.location.hostname)){
        // In-site navigation, blank the referrer - we're not looking to stalk users around the site
        referrer = '';
    }

But, where the site has been mirrored (for example, that Cellebrite issue) there's a good chance everything will be within a single domain. So, it'd be useful to remove this restriction for bad domains.



Issue Links

Toggle State Changes

Activity


assigned to @btasker

I've got mixed feelings about capturing the Page HTML - it's one of those things that's a waste of bandwidth/storage right up until it isn't.

It would have been interesting in jira-projects/CDN#17 to see what they were viewing, similarly it'd be useful to see what a clickjacker is serving to others and it might help me figure out why videos.bentasker.co.uk keeps showing up in the list.

We can grab it with document.documentElement.outerHTML, though might want to think about compressing it (the challenge there though is then decompressing it to make it usable in reporting).

If we do capture page HTML, I'd like to be able to configure/define exclusions to this.

For example, we know there will be lots of onion.ws and onion.ly in there (because Tor2Web). There's precious little benefit in sending page HTML for most of those.

As a low effort solution for compression, I guess we use this - https://github.com/pieroxy/lz-string/blob/master/libs/lz-string.js - and then base64 it (although btoa doesn't seem to like its output).

verified

mentioned in commit 09a4797baae12c6b6be95c5955f494967841d552

Commit: 09a4797baae12c6b6be95c5955f494967841d552 
Author: B Tasker                            
                            
Date: 2022-03-28T13:48:34.000+01:00 

Message

Implement collection of additional information when a bad domain is detected - see websites/privacy-sensitive-analytics#10

+171 -6 (177 lines changed)

At this point, collection works but we end up trying to write fairly massive blobs into the DB.

I decided it's better to write the files to disk on Mikasa and then simply push a URL to view the image at.

root@mikasa:/etc/nginx/domains.d# mkdir /mnt/images
root@mikasa:/etc/nginx/domains.d# chown www-data:www-data /mnt/images/

OK, this has tested OK, so I think we roll a release and then look at improving reporting in the next version.

verified

mentioned in commit ae06788c235a481d6daf180f9815bf20096eee26

Commit: ae06788c235a481d6daf180f9815bf20096eee26 
Author: B Tasker                            
                            
Date: 2022-03-28T15:16:56.000+01:00 

Message

Write screenshot to disk rather than trying to push it upstream. See websites/privacy-sensitive-analytics#10

+13 -1 (14 lines changed)

For now, can query out with

import "strings"

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_enhanced_unauth")
  |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
  |> filter(fn: (r) => strings.containsStr(v: r.domain, substr: v.domain_name))
  |> keep(columns: ["_time","page","domain","page_title","referrer","timezone","screenshot","timezone"])
  |> group()
  |> sort()

mentioned in issue #12

mentioned in issue #14

mentioned in issue #15

marked this issue as related to #15

marked this issue as related to #14