MISC-18: Alternative Load balancing design

Issue Information

Issue Type: Improvement
Priority: Major
Status: Open

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Unresolved
Affects Version: TorCDN,
Target version: TorCDN,

Created: 2016-01-17 15:26:37
Time Spent Working
240 minutes
225 minutes
15 minutes

The current load balancing model is contingent on the race condition in Hidden Service descriptor publishing.

There's no mechanism on the edge itself to balance load, requests will simply go to whichever edge device most recently published it's descriptor to whichever dirauth the user's client contacts.

Although it's not complete yet, interim results from MISC-17 suggest that load may not be spread across the edge quite as hoped.

Both edge devices have seen some requests, but the load has primarily been taken by one device.

Although it needs testing there's no reason to think this would be any different if one edge devices reaches saturation, which would lead to potentially serious act on delivery.

An alternative delivery model might be to have a setup like the following

- Site embed is foo.onion/something.js
- foo.onion/something.js leads to a 302 to to bar.onion/something.js or another.onion/something.js

Where a proportion of the edge would answer to bar.onion and another proportion would answer another.onion. Obviously you could use more than two descriptors if the edge were big enough. Theoretically, all edges could support all HS descriptors, but I suspect we'd then run into the same issue we're trying to work around at the moment.

The obvious issue with this, is you're introducing the time required to set up an additional circuit into the mix. So need to test what the performance impact is from a client's perspective.

If it's negligible then having some kind of mechanism where the initial point of contact (foo.onion) knows the rough load of the edge would allow it to intelligently decide which descriptor to use for the next request it received. Though spray and pray would probably also give some benefit when compared to the current model.

The initial point of contact would also need to be available on multiple edge devices to ensure it's redundant. In principle, it could be available on all edges, though there's a risk that saturation might then impact foo.onion too.

The aim of this issue is to test HTTP redirection based balancing and see what the cost of using that method is.

Issue Links

Toggle State Changes


Depending on whether the module is enabled or not, may be able to use the perl_set directive to generate a descriptor selection.
Unfortunately neither edge has been built with --with-http_perl_module

~# nginx -V
nginx version: nginx/1.9.9
built by gcc 4.7.2 (Debian 4.7.2-5) 
built with OpenSSL 1.0.1e 11 Feb 2013
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-http_auth_request_module --with-threads --with-stream --with-stream_ssl_module --with-http_slice_module --with-mail --with-mail_ssl_module --with-file-aio --with-http_v2_module --with-cc-opt='-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-z,relro -Wl,--as-needed' --with-ipv6

As the main aim is to gauge the impact of redirects, it's not the end of the world, but I'd have preferred to have built a model device selection mechanism to test against.
As boring as it is, the following config is probably sufficient for this set of tests
server {
  server_name foo.onion;
  return 302 http://f5jayrbaz7nmtyyr.onion$request_uri?;
btasker changed status from 'Open' to 'In Progress'
Generated a new PK by adding the following to Torrc on Edge-1
HiddenServiceDir /var/lib/tor/btaskerstreamingtest-redir/
HiddenServicePort 80

Which gives us a descriptor of 52umrndqq5rf2o4v.onion

Configured in NGinx
server {
  listen localhost:80;
  root /usr/share/nginx/empty;
  server_name 52umrndqq5rf2o4v.onion;
  return 302 http://f5jayrbaz7nmtyyr.onion$request_uri?;

ben@milleniumfalcon:~$ curl -vvv -l http://52umrndqq5rf2o4v.onion/noexist/test.foo
* Hostname was NOT found in DNS cache
*   Trying
* Connected to 52umrndqq5rf2o4v.onion ( port 80 (#0)
> GET /noexist/test.foo HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 52umrndqq5rf2o4v.onion
> Accept: */*
< HTTP/1.1 302 Moved Temporarily
* Server nginx is not blacklisted
< Server: nginx
< Date: Mon, 25 Jan 2016 11:16:45 GMT
< Content-Type: text/html
< Content-Length: 154
< Connection: keep-alive
< Location: http://f5jayrbaz7nmtyyr.onion/noexist/test.foo?
<head><title>302 Found</title></head>
<body bgcolor="white">
<center><h1>302 Found</h1></center>
* Connection #0 to host 52umrndqq5rf2o4v.onion left intact

Looks good, so just need to look at setting the test script running
btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '0 minutes' to '13 minutes'
Client script triggered
ben@milleniumfalcon:~$ SERIAL=0; while [ $SERIAL -lt 500000 ]; do select=`shuf -i1-2 -n1`; if [ $select == 2 ]; then extension="html"; else extension="gif"; fi;  number=`shuf -i1-2000 -n1`;  curl -H "X-Downstream: Serial-G$SERIAL" -sL -w "G${SERIAL},%{http_code},\"%{url_effective}\",%{time_total},%{time_namelookup},%{time_connect},%{time_redirect},%{time_starttransfer},%{size_download},%{size_request},%{num_redirects},%{speed_download}\\n" -o /dev/null "http://52umrndqq5rf2o4v.onion/qrcodes/image-${number}.${extension}" >> metricsG.csv;  SERIAL=$(( $SERIAL + 1 ));  done
btasker changed timespent from '13 minutes' to '15 minutes'

Work log

Ben Tasker
2016-01-25 11:17:39

Time Spent: 13 minutes
Log Entry: Configuring and testing servers

Ben Tasker
2016-01-25 11:21:06

Time Spent: 2 minutes
Log Entry: Triggering client requests