##########################################################################################

       PAS-3: Allow configuration of "interesting" Referrers

##########################################################################################


Issue Type: New Feature 
-----------------------------------------------------------------------------------------

Issue Information
====================

Priority: Major					Status:      Open
Resolution:  Unresolved
Project: PCAP Analysis Script (PAS)


Reported By: btasker					
Assigned To: btasker

Components: 
	- HTTP 
	- SSL/TLS 
	- Data Correlation and Extraction 

Affected Versions:		
	- 0.1  			


Targeted for fix in version: 
	- 0.1

Labels: Referrer, RequestPath, 

Time Estimate: 40 minutes
Time Logged:   50 minutes


-----------------------------------------------------------------------------------------

Issue Description
==================

There should be a means to configure regex's to match potentially interesting
paths/hosts.

For example, if you wanted to extract which subreddit's a user has visited on Reddit,
you'd probably want to configure something like the following (to be matched against
referer strings as Reddit is HTTPS)

-- BEGIN SNIPPET --

https:\/\/(www|np|m|i)\.reddit\.com\/r\/([^\/]*)

 -- END SNIPPET --

Ideally, the regex will be used against any paths identified (i.e. Port 80 GET's as well
as referer strings).

-----------------------------------------------------------------------------------------

Issue Relations
================
	
	- relates to PAS-18: Extract interesting paths from Cookies
	- relates to PAS-23: Allow per directory override of configuration


-----------------------------------------------------------------------------------------

Activity
==========

	
-----------------------------------------------------------------------------------------
2015-11-26 13:22:11              btasker
-----------------------------------------------------------------------------------------

This is something that we'll likely want to be able to configure per-run. As it could be
quite long, I don't feel comfortable turning it into a command line argument.

I think the way around it, is to have a hardcoded default, with a means to override it in
a way that doesn't make doing _git pull_ more difficult

The easiest way is probably to have a configuration file (excluded in .gitignore) which if
present will be used to override the hardcoded value. Also opens up the possibility of
doing the same with other elements/classes of traffic.
	
-----------------------------------------------------------------------------------------
2015-11-26 14:21:39              btasker
-----------------------------------------------------------------------------------------

Although I'll probably raise a seperate FR for it, it might also be interesting to do
something similar for cookies. As shown here -
https://www.bentasker.co.uk/documentation/security/313-ipb-nothing-to-hide-and-nothing-to-fear-but-you-can-still-nob-off
- When LinkedIn set their Google analytics cookies, they store the referring domain and
path which opens the possibility that we can also extract paths the user has visited prior
to packet captures commencing.

If we look at the full string for that particular cookie

-- BEGIN SNIPPET --

__utmz=23068709.1445974378.1.1.utmcsr=mail.google.com|utmccn=(referral)|utmcmd=referral|utmcct=/mail/u/0/;

 -- END SNIPPET --
We can see when the cookie was set too

-- BEGIN SNIPPET --

ben@milleniumfalcon:/tmp$ date -d@1445974378.23068709
Tue Oct 27 19:32:58 GMT 2015

 -- END SNIPPET --
Although LinkedIN are a specific culprit, we can extract this from anything using Google
Analytics over HTTP
	
-----------------------------------------------------------------------------------------
2015-11-26 15:56:38              btasker
-----------------------------------------------------------------------------------------

Commit _b1e8539_ implements extraction of interesting paths and referers into the tempdir.
At the moment though it introduces a _lot_ of noise.

Using a (broad) search of

-- BEGIN SNIPPET --

(www|np|m|i)\.reddit\.com\/r\/([^\/]*)|www\.google\.|www\.bbc\.co\.uk

 -- END SNIPPET --

Gives a lot of unexpected results, because some of the requests have www.bbc.co.uk encoded
into the request URI:

-- BEGIN SNIPPET --

edigitalsurvey.com/l.php?id=INS-642345567&v=7038&x=1280&y=1024&d=24&c=null&ck=1&p=%2Fnews%2Fbusiness-32916968&ref=https%3A%2F%2Fwww.google.co.uk%2F&fu=http%3
A%2F%2Fwww.bbc.co.uk%2Fnews%2Fbusiness-32916968&xdm=edr&xdm_o=http%3A%2F%2Fwww.bbc.co.uk&xdm_c=edr0

 -- END SNIPPET --

It'd be simple enough to limit change the filter so it has to occur at the beginning of
the string, but that does make the filters more complex as the string built for request
paths doesn't include the scheme (i.e. http/https), whilst referer strings do, so we'd
need to use something like

-- BEGIN SNIPPET --

^((https:\/\/|http:\/\/)?)(www|np|m|i)\.reddit\.com\/r\/([^\/]*)|^((https:\/\/|http:\/\/)?)www\.google\.|^((https:\/\/|http:\/\/)?)www\.bbc\.co\.uk

 -- END SNIPPET --
Which while manageable is going to make for a pretty big string once you start adding
multiple regex's to it.

I guess it'll do as a starting point though
	
-----------------------------------------------------------------------------------------
2015-11-26 15:57:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: 869220d0297b64c289b4ea2d4eddebaba2a74518
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:52:49 2015 +0000
Commit Message: Started implementing extraction of interesting referers/paths for PAS-3


Modified (-)(+)
-------
.gitignore
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/869220d0297b64c289b4ea2d4eddebaba2a74518


-----------------------------------------------------------------------------------------
2015-11-26 15:59:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: 072be929ede9c1031659732f3eb9236dfd4d6cec
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:57:53 2015 +0000
Commit Message: Forced matches to be at beginning of string. See PAS-3


Modified (-)(+)
-------
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/072be929ede9c1031659732f3eb9236dfd4d6cec


-----------------------------------------------------------------------------------------
2015-11-26 15:59:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: a5c3d5da0dcee0f46081f6b9e4003684278401e7
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 15:58:34 2015 +0000
Commit Message: Updated reddit matches to include user profiles. See PAS-3


Modified (-)(+)
-------
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/a5c3d5da0dcee0f46081f6b9e4003684278401e7


-----------------------------------------------------------------------------------------
2015-11-26 16:06:34              btasker
-----------------------------------------------------------------------------------------

I've added t.co links as an interesting referer as it's a possible route to identifying
which Twitter accounts a person follows (where the news is either niche or new enough),
and may lead to identifying the persons Twitter handle if you're willing to put in the
time to cross-compare follower lists between multiple accounts.

It's entirely reliant on the t.co link redirecting to a http site (as t.co is https)
though as far too much of the web is still http, that's not a huge limitation.

Commit _8dad133_ refers


-----------------------------------------------------------------------------------------
2015-11-26 16:07:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: 8dad133c2b198cdc2e71ff11624cf3998ed4c25c
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:06:12 2015 +0000
Commit Message: Added t.co links to the default matching patterns. See PAS-3


Modified (-)(+)
-------
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/8dad133c2b198cdc2e71ff11624cf3998ed4c25c


-----------------------------------------------------------------------------------------
2015-11-26 16:12:19              
-----------------------------------------------------------------------------------------

btasker changed timespent from '0 minutes' to '30 minutes'
	
-----------------------------------------------------------------------------------------
2015-11-26 16:43:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: adab7e93f025adf7dcd4483f98ddcc72f3663e5f
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:42:59 2015 +0000
Commit Message: Documented config override method implemented in PAS-3


Added (+)
-------
Docs/OverridingConfiguration.md


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/adab7e93f025adf7dcd4483f98ddcc72f3663e5f


-----------------------------------------------------------------------------------------
2015-11-26 16:49:40              btasker
-----------------------------------------------------------------------------------------

I've implemented generation of two reports _interestingdomains.csv_ and
_interestingdomains-full.csv_ see Commit _87b90f0_

The first contains a list of unique matches (i.e. contains only the exact match to
whichever regex it matched), the other contains extended detail.

To use the example given in the documentation
(https://github.com/bentasker/PCAPAnalyseandReport/blob/master/Docs/OverridingConfiguration.md)

If we have a pattern of 

-- BEGIN SNIPPET --

^((https:\/\/|http:\/\/)?)(www|np|m|i)\.reddit\.com\/(r|u)\/([^\/]*)

 -- END SNIPPET --

And the observed traffic contains a HTTP referer header containing
https://www.reddit.com/r/awww/comments/3u2s90/no_i_didnt_drink_it/

Then we'll have the following contents

- interestingdomains.csv

-- BEGIN SNIPPET --

https://www.reddit.com/r/awww

 -- END SNIPPET --

- interestingdomains-full.csv

-- BEGIN SNIPPET --

https://www.reddit.com/r/awww/comments/3u2s90/no_i_didnt_drink_it/       HTTP Referer

 -- END SNIPPET --

The possible values for the second column, at time of writing are

- HTTP Referer
- HTTP Request

Where the latter is us observing someone performing a GET/POST/Whatever against that URL
rather than based on a Referer header.

When we get around to examining Cookies as described above, we can simply add a third
possible value to that column.

Will need to update the Reports documentation to reflect the change
	
-----------------------------------------------------------------------------------------
2015-11-26 16:51:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: 87b90f056863149e747f38a99c493f05bd8801ea
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:49:24 2015 +0000
Commit Message: Started building output report for PAS-3


Modified (-)(+)
-------
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/87b90f056863149e747f38a99c493f05bd8801ea


-----------------------------------------------------------------------------------------
2015-11-26 16:53:28              btasker
-----------------------------------------------------------------------------------------

Given it's possible to override the config now, I've removed the regex's that were
introduced solely to aid testing. Commit _948a0fd_ refers
	
-----------------------------------------------------------------------------------------
2015-11-26 16:54:25              
-----------------------------------------------------------------------------------------

btasker changed timespent from '30 minutes' to '50 minutes'
	
-----------------------------------------------------------------------------------------
2015-11-26 16:55:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: 948a0fd0ac557f44550183f06fb20c237a8cf903
Author: Ben Tasker <github@<Domain Hidden>>

Date: Thu Nov 26 16:53:15 2015 +0000
Commit Message: Removed regex's for domains used only for testing. See PAS-3


Modified (-)(+)
-------
PCAP_Analysis.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/948a0fd0ac557f44550183f06fb20c237a8cf903


-----------------------------------------------------------------------------------------
2015-11-26 16:56:40              btasker
-----------------------------------------------------------------------------------------

Assuming the current test-run completes OK I think this is pretty much implemented. I'll
raise a new FR later for the Cookie parsing/identification as that's likely to be a bit
more involved and I don't want to flood this issue with irrelevant updates.
	
-----------------------------------------------------------------------------------------
2015-11-26 16:56:57              
-----------------------------------------------------------------------------------------

btasker changed labels from 'Referrer' to 'Referrer RequestPath'
	
-----------------------------------------------------------------------------------------
2015-11-27 00:27:41              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: PCAPAnalyseandReport
Commit: b7e36ffb25e248ae4c0c8a0fa525843eff99262c
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Nov 27 00:26:56 2015 +0000
Commit Message: Updated documentation for PAS-3 and PAS-18


Modified (-)(+)
-------
Docs/OverridingConfiguration.md
Docs/Reports.md


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/333881f
 -- END SNIPPET --

https://github.com/bentasker/PCAPAnalyseandReport/commit/b7e36ffb25e248ae4c0c8a0fa525843eff99262c


-----------------------------------------------------------------------------------------

Worklog
========

 
-----------------------------------------------------------------------------------------
2015-11-26 16:12:18              btasker

30 minutes
-----------------------------------------------------------------------------------------

Implementing and testing
 
-----------------------------------------------------------------------------------------
2015-11-26 16:54:25              btasker

20 minutes
-----------------------------------------------------------------------------------------

Tweaking, documenting and re-testing