MISC-2: Make bentasker.co.uk Available as Tor Hidden Service



Issue Information

Issue Type: New Feature
 
Priority: Minor
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Done (2015-05-22 15:57:59)
Affects Version: Bentasker.co.uk via Tor,
Target version: Bentasker.co.uk via Tor,
Labels: bentasker.co.uk, Design, onion, Tor, Website,

Created: 2015-05-16 19:04:29
Time Spent Working


Description
Want to give some thought to whether it's a good idea to also make the site available as a tor HS.

I don't want the Tor client running on the main server for testing, but it could be run on the dev server with an NGinx reverse proxy set up and then moved across if/once it goes live.

That would also allow for tor specific tweaks (like flat out denying any attempt to access administration pages - I generally connect to those via VPN anyway).

I don't need the anonymity protection of a HS for bentasker.co.uk, but it's possible that there may be people who'd rather read via a HS than over the clearnet - this is also, very much, a test-in-principle for another site with a similar set up.

Need to assess the risks, design the setup and test well before making the address publicly available.

If anything, bentasker.co.uk should present a few more challenges than the site this is will eventually be targeted at.


Issue Links

Tor-Talk Thread

Subtasks

MISC-3: Adjust the cdnforjoomla plugin to optionally translate to a .onion
MISC-4: Build custom error pages
MISC-5: Design the method of working with the anti-abuse scripts
MISC-6: Review 3rd party resources/code called from within the site
Toggle State Changes

Activity


One thing to definitely think about (at the very least, it'll need to be mitigated) - if a tor2web node gets indexed by Google, it might result in a duplicate content penalty. So might need to make sure the HS presents a different robots.txt to keep search engines off it.

Also need to configure how Admin tools will behave - if a user repeatedly tries to compromise the front-end, it's GERA's IP that will be blocked.

Will also need to make sure all URLs within the site are relative (they should be) so that people don't get redirected to the clearnet.
Have started a thread on tor-talk laying out the issues I can forsee to see if anyone can think of anything else (https://lists.torproject.org/pipermail/tor-talk/2015-May/037816.html).

Some very good responses so far, addressing things I hadn't thought of (and mitigating some of those I had in ways I hadn't thought of, or didn't know where possible).

The plan, at this point, is as follows (one comment per section to try and keep it readable)
1. Main site uses HTTPS
The tor client will forward port 80 to a HTTP reverse proxy (listening only on localhost) which will then proxy onto the main site via HTTPS.
In doing so, it'll make a couple of changes when going upstream

- Host header will be changed (obviously)
- Insert an header to denote the source is the .onion (more on that in a bit)
- Certain content might be served from disk rather than proxied upstream (more on that in a bit)

Technically, because we're doing a SSL to Plain, you could capture comms between Tor and the NGinx RP, but if you've got the ability to tcpdump the loopback adapter there are plenty of other attacks you could launch (like stealing the HS private key).
2. Duplicate Content Penalty
I had originally though the best way to address the tor2web issue was going to be to serve a customised robots.txt on the .onion.

Still going to do that, however, tor2web also include a header identifying the connection as tor2web (see http://comments.gmane.org/gmane.network.tor.user/34292 ) so we can block (with a useful message) based on that - not only does it prevent Google from indexing the site at a different URL, but it gives the opportunity to tell a genuine user that they can access direct via HTTPS or the .onion (reducing the risk of MITM)
3. Anti-Abuse Scripts

This one is a little more complex (and getting it just right may branch into a sub-task at some point).

Need to be sure that when the Application level protections are repeatedly triggered via the .onion, the resulting ban doesn't adversely affect innocent users who are also accessing via the .onion.

I'm not too keen to make the protections more permissive, as it doesn't address the root issue, just makes it harder to trip, and weakens security in the process.

The method used by Facebook is to tell the origin that the source IP of the client is within the DHCP broadcast network (to ensure it's not routable and won't be in use elsewhere in the network). When protections trip, they've got a real-enough IP to block, meaning the protections themselves don't need to be tinkered with.

So, I could drop a 'unique' IP into X-Forwarded-For (or use a differet header) for each request.

If the same IP is used for any requests within a given connection, the protections can at least effectively invalidate that HTTP keep-alive session.

The downside, is, disconnecting the TCP session and starting a new (or just not using keep-alive) would be all an attacker would need to do to circumvent the ban. But, then, the whole point is that the protections should be good enough to block exploit attempts whether it's the first request made or the millionth.

It's not particularly hard to circumvent IP based bans on the WWW either, so I'm going to roll with it and then re-review later I think.
4. Tweaks to Content

Will need to make sure that all URL's are relative, and re-write those that are not.

In particular, absolute URL's are currently used for static content (as certain static content is served from a subdomain to allow the browser to parallelize a bit). Those URL's will need to be rewritten.

I think, again, I'm going to follow Facebook's approach on this one - I'll rewrite to a subdomain of the .onion

So, taking an existing flow
Visitor -> www.bentasker.co.uk -> plugin -> static.bentasker.co.uk

Simply need to adjust the plugin so that if the source is using a .onion (denoted by the NGinx tweak noted above), the flow becomes
Visitor -> foo.onion -> plugin -> static.foo.onion

Essentially, all we want to do is to rewrite the scheme (from https to http), domain name and the TLD

Similarly, need to make sure that there's little to nothing that actually depends on Javascript being functional - it should be assumed that .onion visitors are going to have Javascript disabled (though that's generally been the assumption on the www. side anyway)
5. CORS et al
I'll obviously need to review anything I've got in place in the CORS sense to make sure the new domain (foo.onion) is permitted so that browser protections don't kick in and cause broken rendering.

There shouldn't be much to check/change, but it needs doing
6. Visitors coming via a Tor Exit
The idea of redirecting anyone coming from a Tor Exit to the .onion had been mooted - but it's been pointed out that it's may well be wise to try to avoid unexpected behaviour with Tor visitors.

Although I'm not currently looking at having to disable any public functionality for the .onion, there's a possibility that I may need to do so once I get into it. So, it could be that implementing such a redirect would mean taking the visitor to a site that doesn't contain the functionality they want (but would have done had they been permitted to use the exit).

Seems best to revisit this once everything else is set up.
7. HTTPS .onion in the longer term?
The plan, from the outset, has been to offer the .onion via port 80, to avoid certificate warnings. In the longer term, though, there may be value in looking at the option of also offering HTTPS.

Apparently Mozilla have announced that they plan to gate new features to only work on HTTPS connections ( https://blog.mozilla.org/security/2015/04/30/deprecating-non-secure-http/ ). Obviously whether that affects Tor users will depend on how exactly Mozilla go about doing that (i.e. whether it's something that can be easily reverted/tested in TBB) as well as which features end up unavailable.

Using HTTPS would also allow Content Security Policy ( CSP ) to be used, so theoretically any link-clicks could be reported (using POST) to an .onion endpoint to help identify any URLs that haven't been successfully rewritten in consideration 4 above.
8. Origin is a Caching Reverse-Proxy
This won't be an issue on the site that this will eventually be deployed on, but is on www.bentasker.co.uk so seems worth addressing.

In consideration 4 we'll be rewriting links depending on whether the visitor originated from the .onion or the www. What we don't want, then, is for responses to be cached within the same namespace.

If the page is cached when someone visits via www then the .onion visitor will go out of an exit - which whilst not terrrible, somewhat undermines the efforts here.

But - if the page is cached when someone visits via .onion, the site will completely break for a visitor on the www (as they won't be able to resolve the .onion)

It's only certain pages that are cached, and there's still some value in doing so, so the simple solution here is to update the cache key to include an indicator of whether it's source from the .onion or not (so that www and .onion become two distinct cacheable entities).
9. Privacy - Generic
There are a number of resources on the site which may/will be undesirable when accessing via .onion:

- Google Ads
- Google Analytics
- Social Media Sharing buttons

I'm highlighting these in particular because they share information with a third party.

The SM buttons have actually been disabled by default for some time (the buttons displayed are just images, clicking them enables them and then you click to tweet/like/whatever). They'll still work the same way afterwards.

The site has had a 'Block Google Analytics' function in it's sidebar for years - it does rely on Javascript, but then if Javascript is disabled, the Analytics functionality won't be firing either.

Adsense, I'm a little torn about. I don't particularly like having the ads up, but in the case of www.bentasker.co.uk they help keep the site live. I've trialled removing them in the past, and had to put them back.

For most users, traffic to the relevant 3rd party services will likely be via an exit node anyway so the concern is slightly less. Where it's slightly more important is where visitors have a gateway running on their LAN specifically so that they have transparent access to .onions (meaning their connection to Google won't route over Tor).

You could argue that it's a risk they take, but I'd prefer to do what I can to mitigate it a little - need to give this one some thought.
10. Analytics will show a lot of bad traffic from a single source
From the outset, I've not been too concerned about this, but it's interesting to note that Facebook's experience has been that it wasn't quite as bad as expected.
11. Sessions/Cookies won't be transferable between www and .onion
My initial thoughts on this is were that this is actually a good thing, and no-one's said anything to the contrary, so recording simply for posterity - no action needed.
Given that others have given quite a lot of input/advice, only seems fair that tracking on this is publicly available. Have moved into the MISC project.
Another thing to think about is that sessions may well break (which is OK if you're just reading, but critical if you're trying to log in) if in the future I configure cookies to include the secure flag. I don't know quite how browsers will handle it if they receive a cookie with that flag over a plain HTTP connection but my suspicion is they'll probably honour the flag (might test later).

Not currently an issue, as the session cookie doesn't use that flag
ben@milleniumfalcon:~$ curl -i https://www.bentasker.co.uk 2> /dev/null | head -n 15 
HTTP/1.1 200 OK
Server: nginx/1.0.15
Date: Tue, 19 May 2015 13:02:24 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: ddac25e9a3295649e43faba6f767ac23=2s44oqp88rd5gv0m8ej48b2ea6; path=/; HttpOnly
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Last-Modified: Tue, 19 May 2015 13:01:43 GMT
Cache-Control: no-cache
Pragma: no-cache
Authorisation: Basic aWlpaWk6aHR0cHM6Ly93d3cuYmVudGFza2VyLmNvLnVrL2ZvbGxvd3RoZXRyYWls
X-Clacks-Overhead: GNU Terry Pratchett


In fact, there are a few 'best practices' for HTTPS that I can't use (or at least will need to account for at the Tor RP end):

- secure flag in cookies
- Strict Transport Security

Probably some others as well.
So, at the reverse proxy level, the following configuration should do the trick
# See MISC-2
server {
    listen       localhost:80;
    server_name  ;   # TODO/TBD
    root /usr/share/nginx/onions/bentaskercouk;


    # We check disk first so I can override things like robots.txt if wanted
    location / {
       try_files $uri $uri/ @proxyme;
    }

    # We want to make sure the homepage is always proxied
    location = / {
        try_files /homepage @proxyme;
    }

    # 404's are handled by the back-end but
    # redirect server error pages to a local file
    #
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/onions/bentaskercouk-errors;
    }


    # Proxy to the back-end
    location @proxyme {

	# Set a header so the back-end knows we're coming via the .onion
	proxy_set_header X-IM-AN-ONION 1;

	# Make sure the host header is correct
	proxy_set_header Host www.bentasker.co.uk;

	# Send the request
        proxy_pass   https://www.bentasker.co.uk;

	# TODO
	# Do we want to cache rather than sending every request upstream? 
	# Probably not, but revisit later
    }


    # Don't even bother proxying these, just deny
    location ~ /\.ht {
        deny  all;
    }

    location ~ /administrator {
        deny  all;
    }

    # We'll come back and put a sane error message here later
    if ($http_x_tor2web){
        return 405;
    } 
}

Running a quick test having chucked an example hostname into the server block
[root ~]# GET -Ssed -H "Host: foobar.test" http://127.0.0.1/
GET http://127.0.0.1/ --> 200 OK
Cache-Control: no-cache
Connection: close
Date: Tue, 19 May 2015 13:15:43 GMT
Pragma: no-cache
Server: nginx
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Last-Modified: Tue, 19 May 2015 13:01:43 GMT
Authorisation: Basic aWlpaWk6aHR0cHM6Ly93d3cuYmVudGFza2VyLmNvLnVrL2ZvbGxvd3RoZXRyYWls
Client-Date: Tue, 19 May 2015 13:15:43 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
Client-Transfer-Encoding: chunked
Content-Base: https://www.bentasker.co.uk/


[root ~]# GET -Ssed -H "Host: foobar.test" http://127.0.0.1/administrator
GET http://127.0.0.1/administrator --> 403 Forbidden
Connection: close
Date: Tue, 19 May 2015 13:16:01 GMT
Server: nginx
Content-Length: 162
Content-Type: text/html
Client-Date: Tue, 19 May 2015 13:16:01 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
Title: 403 Forbidden


[root ~]# GET -Ssed -H "Host: foobar.test" http://127.0.0.1/administrator/index.php
GET http://127.0.0.1/administrator/index.php --> 403 Forbidden
Connection: close
Date: Tue, 19 May 2015 13:16:05 GMT
Server: nginx
Content-Length: 162
Content-Type: text/html
Client-Date: Tue, 19 May 2015 13:16:05 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
Title: 403 Forbidden

Checking local overrides
[root ~]# echo "onion only" > /usr/share/nginx/onions/bentaskercouk/robots.txt

[root ~]# GET -Sse -H "Host: foobar.test" http://127.0.0.1/robots.txt
GET http://127.0.0.1/robots.txt --> 200 OK
Connection: close
Date: Tue, 19 May 2015 13:19:41 GMT
Accept-Ranges: bytes
Server: nginx
Content-Length: 11
Content-Type: text/plain
Last-Modified: Tue, 19 May 2015 13:19:26 GMT
Client-Date: Tue, 19 May 2015 13:19:41 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1

onion only

Looks good to me, finally, checking a tor2web type request
[root ~]# GET -Ssed -H "X-Tor2Web: yup" -H "Host: foobar.test" http://127.0.0.1/
GET http://127.0.0.1/ --> 405 Not Allowed
Connection: close
Date: Tue, 19 May 2015 13:20:21 GMT
Server: nginx
Content-Length: 166
Content-Type: text/html
Client-Date: Tue, 19 May 2015 12:20:21 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
Title: 405 Not Allowed
Change on the Origin to address number 8 above is fairly straightforward.

Not using the onion indicator header directly in the cache key, because an attacker could then hit the www hitting the same page over and over, specifying a different value in that header in order to try and exhaust the space available to the cache.
if ($http_x_im_an_onion){
      set $onionaccess ':true'; # Make sure it won't clash with an existing slug
}

# Disabled
# proxy_cache_key "$scheme$host$request_uri";
proxy_cache_key "$scheme$host$request_uri$onionaccess";

The two sources now have different keys, for the onion site, the key for the homepage would be
httpswww.bentasker.co.uk/:true
So, we can now get requests from a .onion through to the origin. A number of the static resources will now be served from the static subdomains though, so requests for those will still go over the www.

If we take the facebook approach and continue to treat static content as a subdomain, all should work fine - nothing special needs to be done to make sure those requests hit the same HS and the .onion address would (I've just tested), so at the reverse proxy we'll just need a new server block to handle the domain name and proxy on (that one can definitely be configured to cache).

Before that, though, it's probably worth addressing the plugin which performs that re-write for static content, which may or may not be trivial (can't remember when I last looked at that codebase).
Slightly more complex to adjust than I'd hoped - raising a subtask for it to avoid these comments getting to noisy with the specifics - see MISC-3
MISC-3 is complete, and has led me on to a slight tweak to the Nginx changes made above. As well as adjusting the cache-key, I also need to set a request header to send upstream.

Relying on the header sent by the Tor Reverse proxy is a bad idea (partly because I've just documented what it is :) ), and some charitable soul could come along and hit the www. with requests containing that header so that my cache contains lots incorrectly re-written URLS.

The name of that header essentially needs to be kept a secret to prevent that, not ideal, but it's the simplest fix. So the NGinx changes on the origin now become (we can't send the header within the if statement because NGinx won't let us, so need to send it empty if not)

if ($http_x_im_an_onion){
      set $onionaccess ':true'; # Make sure it won't clash with an existing slug

}

proxy_set_header X-1234-its-an-onion $onionaccess;
# Disabled
# proxy_cache_key "$scheme$host$request_uri";
proxy_cache_key "$scheme$host$request_uri$onionaccess";
So, setting up for the static domains,

Cache defined in nginx.conf
proxy_cache_path  /usr/share/nginx/cache levels=1:2 keys_zone=my-cache:8m max_size=1000m inactive=600m;

New server block created for the subdomain
server {
    listen       localhost:80;
    server_name  static.foo.onion;   # TODO/TBD
    root /usr/share/nginx/onions/bentaskercouk;

    # 404's are handled by the back-end but
    # redirect server error pages to a local file
    #
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/onions/bentaskercouk-errors;
    }


    # Proxy to the back-end
    location / {
	# No need to set Im-an-onion as there's no dynamic processing upstream

	# Make sure the host header is correct
	proxy_set_header Host static1.bentasker.co.uk;

	# We do some caching so we're not forever having to do SSL handshakes with upstream
	proxy_cache my-cache;
	proxy_cache_valid  200 302  7d;
	proxy_cache_valid  404      5m;
	proxy_ignore_headers X-Accel-Expires Expires Cache-Control Set-Cookie;
	proxy_cache_key "$scheme$host$request_uri";
	add_header X-Cache-Status $upstream_cache_status;

	# Send the request
        proxy_pass   https://static1.bentasker.co.uk;

	# TODO
	# Do we want to cache rather than sending every request upstream? 
	# Probably not, but revisit later
    }


    # Don't even bother proxying these, just deny - already handled upstream, but why waste a request?
    location ~ /\.ht {
        deny  all;
    }

    location ~ /administrator {
        deny  all;
    }


}

In theory, now, we should be able to browse via the .onion without having any static resources load over the www (though there may still be some links within the content itself as that's not been checked yet).
Almost there, but unfortunately, I missed something - Joomla sets the base href within the head section, so all relative links are prefixed by that value in browser - so not only does static content get retrieved from the www, but clicking any relative in the page will take you off the .onion.

So that'll need to be overridden
At the top of the Joomla template I've added the following (with the obviously wrong values changed)
if (isset($_SERVER['X_1234_ITS_AN_ONION']) && $_SERVER['X_1234_ITS_AN_ONION'] == ':true'){
        $this->base=str_replace("https://www.bentasker.co.uk","http://foo.onion",$this->base);
}

Which resolves the issue. It might be better to look at creating a small plugin to do much the same thing so that it can be managed from the back-end, but the effect is the same.
Clicking round a few pages on the site, seems to be working OK. Will look at creating a script to spider it a little later (for some reason, wget is refusing to do so - though might be PEBKAC)
Raised MISC-4 as a reminder to configure custom error pages where the error is generated at the reverse proxy rather than coming from upstream.
Given
Relying on the header sent by the Tor Reverse proxy is a bad idea (partly because I've just documented what it is ), and some charitable soul could come along and hit the www. with requests containing that header so that my cache contains lots incorrectly re-written URLS.
The name of that header essentially needs to be kept a secret to prevent that, not ideal, but it's the simplest fix.


Needed to update the back-end code for the "Your Data" page - https://www.bentasker.co.uk/your-stored-data - to ensure that it doesn't disclose the header name/value.
I think at the technical level, it's just the error pages to set up now. There's still the philosophical discussion to be had in MISC-6 about which 3rd party scripts (if any) should be permitted on the .onion
I think I've now tested everything I can reasonably test, so it's time to set it live. Generated HS descriptor is 6zdgh5a5e6zpchdz.onion
Added a link 'Browse via Tor' to the Privacy Options sidebar, and made sure it only displays when not accessing the .onion.
Have also configured a description.json to keep ahmia.fi happy - https://ahmia.fi/documentation/descriptionProposal/