There should be a means to configure regex's to match potentially interesting paths/hosts.
For example, if you wanted to extract which subreddit's a user has visited on Reddit, you'd probably want to configure something like the following (to be matched against referer strings as Reddit is HTTPS)
https:\/\/(www|np|m|i)\.reddit\.com\/r\/([^\/]*)
Ideally, the regex will be used against any paths identified (i.e. Port 80 GET's as well as referer strings).
Activity
2015-11-26 13:22:11
I think the way around it, is to have a hardcoded default, with a means to override it in a way that doesn't make doing git pull more difficult
The easiest way is probably to have a configuration file (excluded in .gitignore) which if present will be used to override the hardcoded value. Also opens up the possibility of doing the same with other elements/classes of traffic.
2015-11-26 14:21:39
If we look at the full string for that particular cookie
We can see when the cookie was set too
Although LinkedIN are a specific culprit, we can extract this from anything using Google Analytics over HTTP
2015-11-26 15:56:38
Using a (broad) search of
Gives a lot of unexpected results, because some of the requests have www.bbc.co.uk encoded into the request URI:
It'd be simple enough to limit change the filter so it has to occur at the beginning of the string, but that does make the filters more complex as the string built for request paths doesn't include the scheme (i.e. http/https), whilst referer strings do, so we'd need to use something like
Which while manageable is going to make for a pretty big string once you start adding multiple regex's to it.
I guess it'll do as a starting point though
2015-11-26 15:57:41
Webhook User-Agent
View Commit
2015-11-26 15:59:41
Webhook User-Agent
View Commit
2015-11-26 15:59:41
Webhook User-Agent
View Commit
2015-11-26 16:06:34
It's entirely reliant on the t.co link redirecting to a http site (as t.co is https) though as far too much of the web is still http, that's not a huge limitation.
Commit 8dad133 refers
2015-11-26 16:07:41
Webhook User-Agent
View Commit
2015-11-26 16:12:19
2015-11-26 16:43:41
Webhook User-Agent
View Commit
2015-11-26 16:49:40
The first contains a list of unique matches (i.e. contains only the exact match to whichever regex it matched), the other contains extended detail.
To use the example given in the documentation (https://github.com/bentasker/PCAPAnalyseandReport/blob/master/Docs/OverridingConfiguration.md)
If we have a pattern of
And the observed traffic contains a HTTP referer header containing https://www.reddit.com/r/awww/comments/3u2s90/no_i_didnt_drink_it/
Then we'll have the following contents
- interestingdomains.csv
- interestingdomains-full.csv
The possible values for the second column, at time of writing are
- HTTP Referer
- HTTP Request
Where the latter is us observing someone performing a GET/POST/Whatever against that URL rather than based on a Referer header.
When we get around to examining Cookies as described above, we can simply add a third possible value to that column.
Will need to update the Reports documentation to reflect the change
2015-11-26 16:51:41
Webhook User-Agent
View Commit
2015-11-26 16:53:28
2015-11-26 16:54:25
2015-11-26 16:55:41
Webhook User-Agent
View Commit
2015-11-26 16:56:40
2015-11-26 16:56:57
2015-11-27 00:27:41
Webhook User-Agent
View Commit