PAS-3: Allow configuration of "interesting" Referrers



Issue Information

Issue Type: New Feature
 
Priority: Major
Status: Open

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: PCAP Analysis Script (PAS)
Resolution: Unresolved
Affects Version: 0.1,
Target version: 0.1,
Labels: Referrer, RequestPath,

Created: 2015-11-22 09:31:15
Time Spent Working
Estimated:
 
90 minutes
Remaining:
  
40 minutes
Logged:
  
50 minutes


Description
There should be a means to configure regex's to match potentially interesting paths/hosts.

For example, if you wanted to extract which subreddit's a user has visited on Reddit, you'd probably want to configure something like the following (to be matched against referer strings as Reddit is HTTPS)
https:\/\/(www|np|m|i)\.reddit\.com\/r\/([^\/]*)


Ideally, the regex will be used against any paths identified (i.e. Port 80 GET's as well as referer strings).


Issue Links

Toggle State Changes

Activity


This is something that we'll likely want to be able to configure per-run. As it could be quite long, I don't feel comfortable turning it into a command line argument.

I think the way around it, is to have a hardcoded default, with a means to override it in a way that doesn't make doing git pull more difficult

The easiest way is probably to have a configuration file (excluded in .gitignore) which if present will be used to override the hardcoded value. Also opens up the possibility of doing the same with other elements/classes of traffic.
Although I'll probably raise a seperate FR for it, it might also be interesting to do something similar for cookies. As shown here - https://www.bentasker.co.uk/documentation/security/313-ipb-nothing-to-hide-and-nothing-to-fear-but-you-can-still-nob-off - When LinkedIn set their Google analytics cookies, they store the referring domain and path which opens the possibility that we can also extract paths the user has visited prior to packet captures commencing.

If we look at the full string for that particular cookie
__utmz=23068709.1445974378.1.1.utmcsr=mail.google.com|utmccn=(referral)|utmcmd=referral|utmcct=/mail/u/0/;

We can see when the cookie was set too
ben@milleniumfalcon:/tmp$ date -d@1445974378.23068709
Tue Oct 27 19:32:58 GMT 2015

Although LinkedIN are a specific culprit, we can extract this from anything using Google Analytics over HTTP
Commit b1e8539 implements extraction of interesting paths and referers into the tempdir. At the moment though it introduces a lot of noise.

Using a (broad) search of
(www|np|m|i)\.reddit\.com\/r\/(
^((https:\/\/|http:\/\/)?)(www|np|m|i)\.reddit\.com\/r\/([^\/'>^\/]*)*)|^((https:\/\/|http:\/\/)?)www\.google\.|^((https:\/\/|http:\/\/)?)www\.bbc\.co\.uk

Which while manageable is going to make for a pretty big string once you start adding multiple regex's to it.

I guess it'll do as a starting point though

Repo: PCAPAnalyseandReport
Commit: 869220d0297b64c289b4ea2d4eddebaba2a74518
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:52:49 2015 +0000
Commit Message: Started implementing extraction of interesting referers/paths for PAS-3



Modified (-)(+)
-------
.gitignore
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit


Repo: PCAPAnalyseandReport
Commit: 072be929ede9c1031659732f3eb9236dfd4d6cec
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:57:53 2015 +0000
Commit Message: Forced matches to be at beginning of string. See PAS-3



Modified (-)(+)
-------
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit


Repo: PCAPAnalyseandReport
Commit: a5c3d5da0dcee0f46081f6b9e4003684278401e7
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:58:34 2015 +0000
Commit Message: Updated reddit matches to include user profiles. See PAS-3



Modified (-)(+)
-------
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

I've added t.co links as an interesting referer as it's a possible route to identifying which Twitter accounts a person follows (where the news is either niche or new enough), and may lead to identifying the persons Twitter handle if you're willing to put in the time to cross-compare follower lists between multiple accounts.

It's entirely reliant on the t.co link redirecting to a http site (as t.co is https) though as far too much of the web is still http, that's not a huge limitation.

Commit 8dad133 refers


Repo: PCAPAnalyseandReport
Commit: 8dad133c2b198cdc2e71ff11624cf3998ed4c25c
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:06:12 2015 +0000
Commit Message: Added t.co links to the default matching patterns. See PAS-3



Modified (-)(+)
-------
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

btasker changed timespent from '0 minutes' to '30 minutes'

Repo: PCAPAnalyseandReport
Commit: adab7e93f025adf7dcd4483f98ddcc72f3663e5f
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:42:59 2015 +0000
Commit Message: Documented config override method implemented in PAS-3



Added (+)
-------
Docs/OverridingConfiguration.md




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

I've implemented generation of two reports interestingdomains.csv and interestingdomains-full.csv see Commit 87b90f0

The first contains a list of unique matches (i.e. contains only the exact match to whichever regex it matched), the other contains extended detail.

To use the example given in the documentation (https://github.com/bentasker/PCAPAnalyseandReport/blob/master/Docs/OverridingConfiguration.md)

If we have a pattern of
^((https:\/\/|http:\/\/)?)(www|np|m|i)\.reddit\.com\/(r|u)\/([^\/]*)


And the observed traffic contains a HTTP referer header containing https://www.reddit.com/r/awww/comments/3u2s90/no_i_didnt_drink_it/

Then we'll have the following contents

- interestingdomains.csv
https://www.reddit.com/r/awww


- interestingdomains-full.csv
https://www.reddit.com/r/awww/comments/3u2s90/no_i_didnt_drink_it/       HTTP Referer


The possible values for the second column, at time of writing are

- HTTP Referer
- HTTP Request

Where the latter is us observing someone performing a GET/POST/Whatever against that URL rather than based on a Referer header.

When we get around to examining Cookies as described above, we can simply add a third possible value to that column.

Will need to update the Reports documentation to reflect the change

Repo: PCAPAnalyseandReport
Commit: 87b90f056863149e747f38a99c493f05bd8801ea
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:49:24 2015 +0000
Commit Message: Started building output report for PAS-3



Modified (-)(+)
-------
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

Given it's possible to override the config now, I've removed the regex's that were introduced solely to aid testing. Commit 948a0fd refers
btasker changed timespent from '30 minutes' to '50 minutes'

Repo: PCAPAnalyseandReport
Commit: 948a0fd0ac557f44550183f06fb20c237a8cf903
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:53:15 2015 +0000
Commit Message: Removed regex's for domains used only for testing. See PAS-3



Modified (-)(+)
-------
PCAP_Analysis.sh




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

Assuming the current test-run completes OK I think this is pretty much implemented. I'll raise a new FR later for the Cookie parsing/identification as that's likely to be a bit more involved and I don't want to flood this issue with irrelevant updates.
btasker changed labels from 'Referrer' to 'Referrer RequestPath'

Repo: PCAPAnalyseandReport
Commit: b7e36ffb25e248ae4c0c8a0fa525843eff99262c
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Nov 27 00:26:56 2015 +0000
Commit Message: Updated documentation for PAS-3 and PAS-18



Modified (-)(+)
-------
Docs/OverridingConfiguration.md
Docs/Reports.md




Webhook User-Agent

GitHub-Hookshot/333881f


View Commit

Work log


Ben Tasker
Permalink
2015-11-26 16:12:18

Time Spent: 30 minutes
Log Entry: Implementing and testing

Ben Tasker
Permalink
2015-11-26 16:54:25

Time Spent: 20 minutes
Log Entry: Tweaking, documenting and re-testing