########################################################################################## PAS-3: Allow configuration of "interesting" Referrers ########################################################################################## Issue Type: New Feature ----------------------------------------------------------------------------------------- Issue Information ==================== Priority: Major Status: Open Resolution: Unresolved Project: PCAP Analysis Script (PAS) Reported By: btasker Assigned To: btasker Components: - HTTP - SSL/TLS - Data Correlation and Extraction Affected Versions: - 0.1 Targeted for fix in version: - 0.1 Labels: Referrer, RequestPath, Time Estimate: 40 minutes Time Logged: 50 minutes ----------------------------------------------------------------------------------------- Issue Description ================== There should be a means to configure regex's to match potentially interesting paths/hosts. For example, if you wanted to extract which subreddit's a user has visited on Reddit, you'd probably want to configure something like the following (to be matched against referer strings as Reddit is HTTPS) -- BEGIN SNIPPET -- https:\/\/(www|np|m|i)\.reddit\.com\/r\/([^\/]*) -- END SNIPPET -- Ideally, the regex will be used against any paths identified (i.e. Port 80 GET's as well as referer strings). ----------------------------------------------------------------------------------------- Issue Relations ================ - relates to PAS-18: Extract interesting paths from Cookies - relates to PAS-23: Allow per directory override of configuration ----------------------------------------------------------------------------------------- Activity ========== ----------------------------------------------------------------------------------------- 2015-11-26 13:22:11 btasker ----------------------------------------------------------------------------------------- This is something that we'll likely want to be able to configure per-run. As it could be quite long, I don't feel comfortable turning it into a command line argument. I think the way around it, is to have a hardcoded default, with a means to override it in a way that doesn't make doing _git pull_ more difficult The easiest way is probably to have a configuration file (excluded in .gitignore) which if present will be used to override the hardcoded value. Also opens up the possibility of doing the same with other elements/classes of traffic. ----------------------------------------------------------------------------------------- 2015-11-26 14:21:39 btasker ----------------------------------------------------------------------------------------- Although I'll probably raise a seperate FR for it, it might also be interesting to do something similar for cookies. As shown here - https://www.bentasker.co.uk/documentation/security/313-ipb-nothing-to-hide-and-nothing-to-fear-but-you-can-still-nob-off - When LinkedIn set their Google analytics cookies, they store the referring domain and path which opens the possibility that we can also extract paths the user has visited prior to packet captures commencing. If we look at the full string for that particular cookie -- BEGIN SNIPPET -- __utmz=23068709.1445974378.1.1.utmcsr=mail.google.com|utmccn=(referral)|utmcmd=referral|utmcct=/mail/u/0/; -- END SNIPPET -- We can see when the cookie was set too -- BEGIN SNIPPET -- ben@milleniumfalcon:/tmp$ date -d@1445974378.23068709 Tue Oct 27 19:32:58 GMT 2015 -- END SNIPPET -- Although LinkedIN are a specific culprit, we can extract this from anything using Google Analytics over HTTP ----------------------------------------------------------------------------------------- 2015-11-26 15:56:38 btasker ----------------------------------------------------------------------------------------- Commit _b1e8539_ implements extraction of interesting paths and referers into the tempdir. At the moment though it introduces a _lot_ of noise. Using a (broad) search of -- BEGIN SNIPPET -- (www|np|m|i)\.reddit\.com\/r\/([^\/]*)|www\.google\.|www\.bbc\.co\.uk -- END SNIPPET -- Gives a lot of unexpected results, because some of the requests have www.bbc.co.uk encoded into the request URI: -- BEGIN SNIPPET -- edigitalsurvey.com/l.php?id=INS-642345567&v=7038&x=1280&y=1024&d=24&c=null&ck=1&p=%2Fnews%2Fbusiness-32916968&ref=https%3A%2F%2Fwww.google.co.uk%2F&fu=http%3 A%2F%2Fwww.bbc.co.uk%2Fnews%2Fbusiness-32916968&xdm=edr&xdm_o=http%3A%2F%2Fwww.bbc.co.uk&xdm_c=edr0 -- END SNIPPET -- It'd be simple enough to limit change the filter so it has to occur at the beginning of the string, but that does make the filters more complex as the string built for request paths doesn't include the scheme (i.e. http/https), whilst referer strings do, so we'd need to use something like -- BEGIN SNIPPET -- ^((https:\/\/|http:\/\/)?)(www|np|m|i)\.reddit\.com\/r\/([^\/]*)|^((https:\/\/|http:\/\/)?)www\.google\.|^((https:\/\/|http:\/\/)?)www\.bbc\.co\.uk -- END SNIPPET -- Which while manageable is going to make for a pretty big string once you start adding multiple regex's to it. I guess it'll do as a starting point though ----------------------------------------------------------------------------------------- 2015-11-26 15:57:41 git ----------------------------------------------------------------------------------------- -- BEGIN QUOTE -- Repo: PCAPAnalyseandReport Commit: 869220d0297b64c289b4ea2d4eddebaba2a74518 Author: Ben Tasker