MISC-12: Optimising Video Delivery for Tor / Building a Tor based CDN



Issue Information

Issue Type: New Feature
 
Priority: Major
Status: Open

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Unresolved
Affects Version: TorCDN,
Target version: TorCDN,
Labels: CDN, Experiments, Tests, Tor, Video,

Created: 2015-12-14 12:32:38
Time Spent Working
Estimated:
 
1320 minutes
Remaining:
  
805 minutes
Logged:
  
942 minutes


Description
Want to run some experiments into possible setups for efficiently delivering streaming video via Tor Hidden Services. For what I've got in mind, it needs to be ABR and I hate smooth streaming so we'll go with HLS.

The aim is to build a tiered system with a single origin.

- node 1 - nginx caching reverse proxy - Hidden service 1
- node 2 - nginx caching reverse proxy - Hidden service 1
- node 3 - Nginx caching reverse proxy - Hidden service 2
-- origin - Hidden service 3

Where both node1 and node2 advertise the same hidden service. Essentially using the descriptor publishing race condition to build the edge of a small tiered CDN.

Hidden service 2 then proxies onto the origin. The idea being that it should be possible to easily introduce an additional cache at that level of the heirachy to further protect the origin, again, by using the race condition.

Once setup, need to test against both VoD and Linear content.

This is, in part, an expansion of MISC-2 in that CDN like behaviour is already being used for some static content on http://6zdgh5a5e6zpchdz.onion/

The ultimate aim isn't actually to deliver video, but to gauge the feasibility of building an efficient, fully Tor based CDN to aid in scalability and fault resistance (without needing to maintain multiple origins). I've specifically chosen streaming video as a starting point because delivery is time sensitive and issues are easily observable.

The aim is to start with full HD video at 60fps and then once delivery of that has been improved, work down to delivering something more realistic (so, either 720p at 24fps or 480p at 24fps and small static files, e.g. images/CSS).


Issue Links

Big Buck Bunny homepage
Design (Projectsstatic)
Test Summary Metrics (Projectsstatic)
Test Number Index (Projectsstatic)
Output Index (projectsstatic)

Subtasks

MISC-13: Run tests against 480p HLS Stream
 
  
MISC-14: Pull Stats from Collected Log files
 
  
MISC-15: Theoretical: Productisation of a CDN as a service
 
 
MISC-17: Image Tests
 
 
Toggle State Changes

Activity


Have started designing the tests in extnotes.

A copy of Big Buck Bunny is currently being transcoded into a multiple bitrate HLS stream (using HLS Stream Creator).
btasker changed timespent from '0 minutes' to '40 minutes'
As well as the bits in the design, there's some additional things to consider

Further Considerations

Assuming reliable delivery to multiple clients can be sustained by splitting the edge, further consideration is needed regarding the impact on the Tor network itself.

At time of writing, relays advertise about 150Gbps of bandwidth within the Tor network (according to https://metrics.torproject.org/bandwidth.html). That capacity could easily be saturated by widespread delivery of even low bandwidth video.

A number of possible solutions come to mind

- Make every node within the CDN a middle relay to give some bandwidth back
- Access the midtier and/or the origin via Clearnet

Both have potential privacy implications

At first thought, the first solution risks making individual CDN nodes identifiable. If a relay goes down (something that's publicly recorded) at the same time as a hidden service (testable but not publicly recorded), over time, you can tie the two together and have the IP of the system responsible for hosting the hidden service.

IOW, with

- Relay 1 - foo.onion

If you can see foo.onion is down, there's a good chance it's hosted on the system running Relay 1

Thinking about it though, the situation here is slightly different. There will be (at least) two edge caches advertising the same descriptor. So if we take the following topology

- Relay 1 / Edge 1 - foo.onion
- Relay 2 / Edge 2 - foo.onion

If Relay 1 goes down, foo.onion will remain available. There may be a short period between descriptors being published where that isn't the case though? Would need to measure.

I guess, you probably add slightly more or slightly fewer relays though. If you've got a bunch of 24 relays all with the same "MyFamily" (as should be the case given the common operator) and your CDN is made up of 24 nodes, it correlates a bit closely.

More to the point, if your CDN is built on a bunch of systems with 10Gbps NICs, you might want to not have a relay on some of those, and instead offer some relays with 1Gbps NICs to reduce the likelihood of commonality.

Will come back to thinking about this later, it's easy to tie yourself in knots with various permutations.
The second solution, however, should definitely have a marked impact;

With every tier of the CDN being a hidden service, a 1MB file coming from the origin will need to transit a circuit (of 6 hops) 3 times in order to reach the ultimate client.

That can be addressed by making the caching as efficient as possible so that the upstream path is rarely used, but, still doesn't address the question of what happens if a huge number of users start streaming video at the same time.

Even if everything is served directly from the caches on the edge, that's still potentially a large saturation of Tor's bandwidth, so ideally you'd want to be giving some back (taking us back to Solution 1).

Might be that the best bet is a combination of the two (or something else), but the suitability of either solution will also depend on whether the CDN operator wishes to remain anonymous, or whether the aim is simply to protect client connections?

An assessment of the options needs factoring into the final writeup though really.
btasker changed timespent from '40 minutes' to '85 minutes'
btasker changed timespent from '85 minutes' to '100 minutes'
btasker changed status from 'Open' to 'In Progress'
btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '100 minutes' to '160 minutes'
btasker changed status from 'Open' to 'In Progress'
Origin and mid-tier are now configured.

Interesting thing to note, testing so far has been via a transparent Tor client with all players using that same client. When attempting to stream three copies of the stream the client started giving connection refused and needed to be restarted in order to be able to resume streaming (client on the midtier was fine though)

Will be interesting to see whether the same thing occurs when spreading requests across the edge. For the main tests, each player should have a dedicated tor client though.

URLs so far

- Origin: https://streamingtest.bentasker.co.uk
- Midtier: http://cix7cricsvweeu6k.onion:8091/

Will look at building the edge tomorrow
btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '160 minutes' to '195 minutes'
btasker changed status from 'Open' to 'In Progress'
First edge node is built and can correctly connect to the mid-tier via it's Hidden Service descriptor.

The descriptor to use for accessing via the edge is http://f5jayrbaz7nmtyyr.onion


For some reason NGinx was ignoring the resolver directive, so had to set Tor's DNS port to be localhost:53 and then update resolv.conf to direct all queries through there (could also have transparently redirected in iptables, but seemed better to be explicit).
btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '195 minutes' to '241 minutes'
To help make sure the stats I'm grabbing are appropriate (and that the log split works) I've done a few test plays with just the one edge node and compared to stats generated yesterday.

I've tweaked the output format of the stats, and added a request count in the later stats. The stats are specifically for test plays using the JWPlayer Stream tester page - http://demo.jwplayer.com/stream-tester/ - the bandwidth setting was set to force the player to use the 1Mb/s stream.

Direct to Origin: 

    Fields: Mean       Max   Min
    Origin: 0.00117829 4.834 0.000


Direct to Midtier:

    Fields: Mean       Max   Min
    Origin:  0.00125514 2.868 0.036
    Midtier: 0.00143373 5.911 0.000


One Edge node Live (test1)

    Fields:  Mean       Max        Min        Req count
    Origin:  0.00060664,0.75500000,0.02400000,211
    Midtier: 0.00087204,0.88300000,0.09800000,211
    Edge1:   0.01962736,9.90600000,0.23400000,212




Test 2 (One edge node live, cache warm from previous playout). 

    Fields:  Mean       Max        Min        Req count
    Origin:  0.07100000,0.07100000,0.07100000,1
    Midtier: 0.12300000,0.12300000,0.12300000,1
    Edge1:   0.00000472,3.47200000,0.00000000,212


Given that the segments are an average of 2 seconds long (see comments on HLS-5 for why it varies), the observed delivery time of 9 seconds risks the playback session breaking as the buffer underruns.

But, based on the stats, it looks like those long durations were as a result of difficulties in getting the content to the client, rather than delays introduced trying to acquire from upstream (unless the delay was as a result of taking a while to establish an upstream connection).

Will look at getting the other edge node online now so we can see how/if requests balance across them. I'd expect that an individual player would probably use the same edge node for the duration of the playout session (at least, for something as short as Big Buck Bunny) but maybe that's not going to be the case.
Second edge node is online, configuration had to be tweaked slightly as it was already using a local unbound install, so the tor client has been configured as a forward zone for onions.
btasker changed timespent from '241 minutes' to '277 minutes'
Ran another test to check that Edge2 is actually capable of delivering. For this, the midtier was warm (from the previous playout) and Edge2 was completely cold
Test 3 (Edge 2 cold, mid-tier warm)

    Fields:  Mean       Max        Min        Req count
    Origin:  0.00000000,0.00000000,0.00000000,101
    Midtier: 0.00723113,4.68100000,0.01100000,212
    Edge1:   -nan,      0.00000000,9999.00000000,0
    Edge2:   0.01183491,5.75600000,0.57500000,212

Dispositions

    Midtier CACHE_HIT: 111
    Midtier CACHE_REVALIDATED: 101
    Edge 2 CACHE_MISS: 212


So, it looks like a single player will always go to the same edge cache, at least for a short playback session.

I'm happy the infrastructure seems to be working, so can start the tests laid out in the design document.
btasker changed timespent from '277 minutes' to '292 minutes'
With a single player, some of the tests in the design document are moot now that we know a single session won't ordinarily balance out across the edge.

So removing

- 2 edge caches online, proxying to origin (cold cache)
- 2 edge caches online, proxying to origin (warm cache)
- 2 edge caches online, mid-tier online, proxying to origin (caching)

From the initial set of tests leaves us with

- Direct to origin
- 1 edge cache online, proxying to origin (cold cache)
- 1 edge cache online, proxying to origin (warm cache)
- edge cache online, mid-tier online, proxying to origin (caching)
- Multiple players, multiple VoD streams (with overlap between players)
- Multiple players, multiple VoD streams, limited cache space (to force LRUing)

We've essentially already run some of those in testing the setup, but to keep things simple will simply repeat those tests.

From here-on out the bandwidth setting in JW Player will be set to auto so that we can see how often it moves between the available options.
Test8 has had a slightly unexpected outcome in that the log line extraction includes more requests than were seen at the edge. Will need to review when putting the final stats together, but I can't see anything immediately wrong with the test itself
The issue with test8's log was simply an error in the log extraction command, so it had included requests from a previous test. They've been spliced out now.
That's the last of the first round of the single-player tests run. Test numbering is

Tests during Setup

- test1 - One Edge node live (cold)
- test2 - One Edge node live (warm)
- test3 - One edge node live (cold), midtier (warm)
- test4 - Missed a number here.....


Formal tests

- test5 - Direct to origin (via HTTP)
- test6 - Direct to midtier (cold)
- test7 - Direct to midtier (warm)
- test8 - To edge (cold), midtier (warm)
- test9 - To edge (warm), midtier (warm)
- test10 - To edge (cold), midtier (cold)

The multi-client/multi-player tests are next.

The unsurprising observation so far is that when things go well, HLS playout is fine via an .onion, but if issues are experienced there isn't a lot of elbowroom in a 2 second fragment to avoid it impacting on playback..
The single stream, multi-player tests are (for now) complete

- test11 - Two players, Edge 2 cache warmish (from previous playout)
- test12 - Two players, Edge 2 (warm)
- test13 - One player, Edge 1 (cold), midtier (cold)
- test14 - Two players, Edge 1 cache warmish (from previous playout)
- test15 - Two players, Edge 1 (warm)
- test16 - Two players, Edge 1 (warm)

Test 16 was essentially a repeat of test 15, because there were far more revalidations than expected in 15. Likely a result of the artificial warm taking longer to run than expected.

Delivery is still a little shakey at times, and the higher bandwidth stream still isn't really being utilised. So, I don't see any point in proceeding with the linear tests at the moment (as the stream will be unwatchable).

I'm going to move onto using an increased segment size (10 seconds) and will run the same tests to see what improvement, if any, it gives. It might be that a midway point (4 or 6 seconds) yields more benefit though.
btasker changed timespent from '292 minutes' to '472 minutes'
The single player tests with 10 second fragments are complete

- Test 17 (Direct to Origin)
- Test 18 (Direct to midtier, coldcache)
- Test 19 (Direct to midtier, warmcache)
- Test 20 (Cold edge, warm midtier)
- Test 21 (Warm edge, Warm midtier)
- Test 22 (Cold edge, cold midtier)
- Test 23 (warm edge, warm midtier)
- Test 24 (Artificially warmed Edge/midtier)

Test 21 saw some serious delivery issues which appear (based on a quick glance) to have been caused by the circuit to the client collapsing. So Test 23 is essentially a repeat of 21.

Will move onto the multi-player tests in a while
btasker changed timespent from '472 minutes' to '622 minutes'
The multiplayer single stream tests against 10 second fragments are done

- Test 25 - Two players, cold edge
- Test 26 - Two players, warm edge

So that's all the 1080p VoD tests done.

The unsurprising conclusion so far is that Full HD delivery through tiered Tor Hidden Services is possible, but gives an unpredictable playback experience. But then, that's largely the case on the clearnet too.

I need to re-transcode the 720p stream, so I'll run the next set of tests against a 480p copy. Given this issue is already quite long, I'll raise a subtask for any subsequent steps required.
btasker changed timespent from '622 minutes' to '667 minutes'
btasker added 'CDN Experiments Tests Tor Video' to labels
All video tests are complete, and the log entries have been pulled into a MySQL database ready for further analysis. It's not been the most structured/disciplined testing I've ever done, but I think I've largely got the information I need - at least based on watching the video as it streamed.
btasker changed Project from 'BenTasker.co.uk' to 'Miscellaneous'
btasker changed Key from 'BEN-600' to 'MISC-12'
btasker added 'TorCDN' to Version
btasker added 'TorCDN' to Fix Version
Have moved this into the publicly facing MISC project so that I can start putting some of the output of MISC-14 into pubnotes

Work log


Ben Tasker
Permalink
2015-12-14 13:15:56

Time Spent: 40 minutes
Log Entry: Initial Design

Ben Tasker
Permalink
2015-12-14 14:02:18

Time Spent: 45 minutes
Log Entry: Updating design, and going off on a tangent.....

Ben Tasker
Permalink
2015-12-14 14:28:44

Time Spent: 15 minutes
Log Entry: Designing NGinx config

Ben Tasker
Permalink
2015-12-15 16:21:52

Time Spent: 60 minutes
Log Entry: Setting up origin and testing http/https delivery

Ben Tasker
Permalink
2015-12-16 11:18:45

Time Spent: 35 minutes
Log Entry: Setting up and testing midtier

Ben Tasker
Permalink
2015-12-16 12:06:05

Time Spent: 46 minutes
Log Entry: Building edge node

Ben Tasker
Permalink
2015-12-16 13:55:42

Time Spent: 36 minutes
Log Entry: Building second edge node

Ben Tasker
Permalink
2015-12-16 14:39:23

Time Spent: 15 minutes
Log Entry: Testing Edge2

Ben Tasker
Permalink
2015-12-17 13:45:51

Time Spent: 180 minutes
Log Entry: Running Single stream, multiplayer tests.

Clearly massively underestimated the time needed for this project....

Ben Tasker
Permalink
2015-12-18 12:48:22

Time Spent: 150 minutes
Log Entry: Running single client, 60fps 10sec segment tests

Ben Tasker
Permalink
2015-12-18 16:05:56

Time Spent: 45 minutes
Log Entry: Running multiple player, single stream tests against 10s segments