##########################################################################################

       BHAYSTACKG-2: Current Test configuration exhausted File Descriptors on remote Web Server

##########################################################################################


Issue Type: Bug 
-----------------------------------------------------------------------------------------

Issue Information
====================

Priority: Major					Status:      Closed
Resolution:  Fixed (2014-12-09 13:49:35)
Project: BASH Haystack Generator (BHAYSTACKG)


Reported By: btasker					
Assigned To: btasker

Components: 
	- Testing 

Affected Versions:		
	- 0.1  			


Targeted for fix in version: 
	- 0.1


Time Estimate: 0 minutes
Time Logged:   0 minutes


-----------------------------------------------------------------------------------------

Issue Description
==================

The test configuration did make a noticeable difference to the shape of traffic graphs.

However this was undermined as at roughly 2330 last night the remote server hit it's FD
limit and stopped responding to requests (though it continued to accept requests).

The result was that the traffic graphs (measured client side) now show plenty of egress
but no measurable ingress (and of course requests for other things aren't being satisfied
either).

The configuration used is as follows

*Cron* 


-- BEGIN SNIPPET --

*/15 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 >
/dev/null
*/5 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c
test1-trickle > /dev/null

 -- END SNIPPET --

*test1 config* 


-- BEGIN SNIPPET --

LOCKFILE='/tmp/haystackfile.lock'

# Endpoint configuration
REMOTE_FILE='http://server.test/haystackfile.img'
HTTP_HOSTNAME='haystack.test'

# Data Downloads
MAX_CHUNK_SIZE=6553600 # Maximum chunk size to request (bytes)
MAX_REQUESTS=100 # Maximum number of requests per session
RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session
USER_AGENT='HaystackGen V0.1'
MAX_DELAY=5 # maximum delay between requests

# Data uploads
SEND_DATA="r" # Send random data upstream?
MAX_UPSTREAM="655360" # bytes
RANDOMIZE_US_CHUNKSIZE="y" # use the same size upstream per session

 -- END SNIPPET --


*test1-trickle config* 


-- BEGIN SNIPPET --

LOCKFILE='/tmp/haystackfile.lock'

# Endpoint configuration
REMOTE_FILE='http://server.test/haystackfile.img'
HTTP_HOSTNAME='haystack.test'

# Data Downloads
MAX_CHUNK_SIZE=393216 # Maximum chunk size to request (bytes)
MAX_REQUESTS=1200 # Maximum number of requests per session
RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session
USER_AGENT='HaystackGen V0.1'
MAX_DELAY=3 

# Data uploads
SEND_DATA="r" # Send random data upstream?
MAX_UPSTREAM="4096" # bytes
RANDOMIZE_US_CHUNKSIZE="n"

 -- END SNIPPET --

Obviously a server-side fix would be to increase FD limits, but want to see whether it's
possible to achieve a config which has the desired effect on the traffic graphs without
needing to change the default FD limits.

For reference, on the test server


-- BEGIN SNIPPET --

$ ulimit -Hn
1024
$ ulimit -Sn
1024

 -- END SNIPPET --

Endpoint is running NGinx.

-----------------------------------------------------------------------------------------

Issue Relations
================
	
	- relates to BHAYSTACKG-3: Analyse desirable traffic behaviour


-----------------------------------------------------------------------------------------

Activity
==========

	
-----------------------------------------------------------------------------------------
2014-12-08 11:23:11              btasker
-----------------------------------------------------------------------------------------

The script was deliberately designed so that scheduled runs could run into each other, but
I suspect I've made a miscalculation on the appropriate interval. Looking at the test
client, there are 41 client instances running and requesting data


-- BEGIN SNIPPET --

[root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l
41

 -- END SNIPPET --

The vast majority of which are the trickle script. Some are probably stuck as a result of
the issues with the remote server, so have killed the lot


-- BEGIN SNIPPET --

[root@testclient ben]# for i in `ps aux | grep "BASH_Haystack_generator" | awk '{print
$2}'`;do kill $i; done

 -- END SNIPPET --

will check back in a bit to see what the count is
	
-----------------------------------------------------------------------------------------
2014-12-08 12:24:20              btasker
-----------------------------------------------------------------------------------------

Definitely building up


-- BEGIN SNIPPET --

[root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l
16
[root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l
10

 -- END SNIPPET --

Will need to do some fine-tuning I guess
	
-----------------------------------------------------------------------------------------
2014-12-08 13:49:29              
-----------------------------------------------------------------------------------------

btasker added '0.1' to Version
	
-----------------------------------------------------------------------------------------
2014-12-08 13:49:29              
-----------------------------------------------------------------------------------------

btasker added '0.1' to Fix Version
	
-----------------------------------------------------------------------------------------
2014-12-08 13:52:46              btasker
-----------------------------------------------------------------------------------------

Have raised BHAYSTACKG-3 to do some more analysis on desirable traffic
patterns/behaviour.

In the meantime as the generator is obviously able to run for longer than I expected


-- BEGIN SNIPPET --

[root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l
28

 -- END SNIPPET --

Am reducing the run frequencies in crontab


-- BEGIN SNIPPET --

*/25 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 >
/dev/null
*/15 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c
test1-trickle > /dev/null

 -- END SNIPPET --

Suspect there'll still be some overlap, but it should at least reduce the effect of
runaway build up

	
-----------------------------------------------------------------------------------------
2014-12-08 13:55:39              btasker
-----------------------------------------------------------------------------------------

Worth noting that when NGinx hit it's FD limit, the log entry


-- BEGIN SNIPPET --

[alert] 13735#0: accept() failed (24: Too many open files)

 -- END SNIPPET --

Was logged so many times that the error log hit 7GB
	
-----------------------------------------------------------------------------------------
2014-12-08 14:39:44              btasker
-----------------------------------------------------------------------------------------

The scheduling change has reduced the build-up of overlapping runs, but the traffic levels
on the tunnel are now very low.

From some of the work done in BHAYSTACKG-2, I think the best way forward is to adjust the
script to have different 'profiles' so that it can ape the behaviour of some legitimate
traffic patterns.
	
-----------------------------------------------------------------------------------------
2014-12-08 15:28:28              btasker
-----------------------------------------------------------------------------------------

Just had a repeat event, interestingly, NGinx's error log went from 0 bytes to 7GB in 11
seconds (and there weren't that many requests during that time - 8 requests logged in
access log).

It's possible the issue isn't what was first thought. On the offchance it's an NGinx
specific issue, have disabled sending additional POST data upstream to see if it makes a
difference (seems unlikely though)
	
-----------------------------------------------------------------------------------------
2014-12-08 23:19:46              btasker
-----------------------------------------------------------------------------------------

Haven't seen a repeat yet, though there've also been significantly fewer runs
overlapping.

Have configured a second test server and am also running against that to see whether the
issue can be reproduced.
	
-----------------------------------------------------------------------------------------
2014-12-09 09:56:57              btasker
-----------------------------------------------------------------------------------------

No repeat event on either server overnight.

Have re-enabled POST (i.e. set SEND_UPSTREAM to 'r') in the client for one server and see
whether we then see the issue again. Have also re-increased the frequency in crontab to
increase the likelihood of reproducing the issue


-- BEGIN SNIPPET --

*/15 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 >
/dev/null
*/5 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c
test1-trickle > /dev/null

 -- END SNIPPET --

If/when the issue re-occurs, will check NGinx's FD usage to confirm for definite whether
it's an issue of hitting file descriptor limits.
	
-----------------------------------------------------------------------------------------
2014-12-09 10:58:34              btasker
-----------------------------------------------------------------------------------------

Definitely a FD issue


-- BEGIN SNIPPET --

root@test:/etc/security# lsof -p 32118 | wc -l
1042

root@test:/etc/security# cat /proc/32118/limits | grep "Max open files"
Max open files            1024                 1024                 files 

 -- END SNIPPET --

Looking at the traffic patterns generated by running the script less frequently; the
result is pretty undesirable as it's possible to roughly identify when the script is
likely to be running, so the increased frequency of runs is currently required.

Until the generated traffic patterns get a bit smarter (BHAYSTACKG-3), the only real
solution is to up the FD limit (1024 is really low anyway) so that an appropriate
frequency of checks can be maintained.

Have set the FD limit to 10,000 on the test server, will check back in a while to see what
number is actually used


-----------------------------------------------------------------------------------------
2014-12-09 13:17:14              btasker
-----------------------------------------------------------------------------------------

Although there are only currently 6 sessions running, NGinx seems to have a lot of sockets
sat at CLOSE_WAIT


-- BEGIN SNIPPET --

root@tora:/etc/nginx# lsof -p 884 | grep CLOSE_WAIT | wc -l
849

 -- END SNIPPET --

It's also using a lot of file descriptors for the haystack file


-- BEGIN SNIPPET --

root@tora:/etc/nginx# lsof -p 884 | grep haystackfile | wc -l
2042

 -- END SNIPPET --

After waiting a little bit, NGinx has also started complaining it's run out of workers


-- BEGIN SNIPPET --

2014/12/09 11:44:08 [alert] 884#0: 1024 worker_connections are not enough
2014/12/09 11:44:08 [alert] 884#0: 1024 worker_connections are not enough
2014/12/09 11:44:09 [alert] 884#0: 1024 worker_connections are not enough

 -- END SNIPPET --

I'm going to go out on a limb and guess that because NGinx thinks its using a persistent
connection (the default for HTTP 1.1 being keep-alive) it's not closing the connection,
and for some reason the FIN isn't doing it (which would be a bug in NGinx if that's the
case).

Adding a 'Connection: close' header to the response, but disabling all runs first so NGinx
can be restarted to make counting the number of CLOSE_WAITs a little easier.


-- BEGIN SNIPPET --

[root@testclient Haystack_configs]# touch /tmp/haystackfile.lock
[root@testclient Haystack_configs]# for i in `ps aux | grep Haystack | awk '{print $2}'`;
do kill $i; done

 -- END SNIPPET --

On the server


-- BEGIN SNIPPET --

[root@test ~]# service nginx restart
[root@test ~]# lsof -p 7494 | grep "CLOSE_WAIT" | wc -l
0

 -- END SNIPPET --

Updated the relevant lines in the placeRequest function


-- BEGIN SNIPPET --

        if [ "$UPSTREAM" == 'n' ]
        then
                # Place the request
                curl -H "Connection: close" -H "User-agent: $USER_AGENT" -H "Host: $HOST"
-H "Range: bytes=$START-$END" "$URL" > /dev/null
        else
                curl -X POST -d "$DATA" -H "Connection: close" -H "User-agent:
$USER_AGENT" -H "Host: $HOST" -H "Range: bytes=$START-$END" "$URL" > /dev/null
        fi

 -- END SNIPPET --

and removed the lockfile


-- BEGIN SNIPPET --

[root@testclient Haystack_configs]# rm -f /tmp/haystackfile.lock

 -- END SNIPPET --

Next scheduled cronjob should kick in in about 4 minutes

	
-----------------------------------------------------------------------------------------
2014-12-09 13:27:04              btasker
-----------------------------------------------------------------------------------------

The scheduled cron has fired, and I've also manually triggered some runs


-- BEGIN SNIPPET --

[root@testclient ~]# ps aux | grep configs/test | wc -l
5

 -- END SNIPPET --

The server isn't currently showing any sockets in CLOSE_WAIT


-- BEGIN SNIPPET --

[root@test:~]# lsof -p 7494 | grep "CLOSE_WAIT" | wc -l
0

 -- END SNIPPET --

There are runs occuring that are using POST and also others using GET, so it looks like
it's a bug in NGinx, though the version in the repos appears to be quite old


-- BEGIN SNIPPET --

[root@test:~]# nginx -v
nginx version: nginx/0.7.67

 -- END SNIPPET --

Will look at compiling a newer version of NGinx in a while to re-test, but it seems best
to pre-empt this bug within the client script anyway by adding a _Connection_ header.
	
-----------------------------------------------------------------------------------------
2014-12-09 13:40:15              btasker
-----------------------------------------------------------------------------------------

Note: Have moved the codebase over to GitHub so that the commit notifications link to a
publicly accessible repo
	
-----------------------------------------------------------------------------------------
2014-12-09 13:43:09              git
-----------------------------------------------------------------------------------------


-- BEGIN QUOTE --


Repo: BASH-Haystack-Generator
Commit: aca5d7a8212c45ff25dc1e99497e119329bb9ee5
Author: Ben Tasker <github@<Domain Hidden>>

Date: Tue Dec 09 13:40:21 2014 +0000
Commit Message: Disabled keep-alive connections to work around an NGinx bug. See
BHAYSTACKG-2


Modified (-)(+)
-------
client-side/request_generator.sh


-- END QUOTE --

*Webhook User-Agent*


-- BEGIN SNIPPET --

GitHub-Hookshot/dd9ba0b
 -- END SNIPPET --

https://github.com/bentasker/BASH-Haystack-Generator/commit/aca5d7a8212c45ff25dc1e99497e119329bb9ee5


-----------------------------------------------------------------------------------------
2014-12-09 13:49:35              btasker
-----------------------------------------------------------------------------------------

Marking this issue as fixed as the root cause has been identified and worked around.
	
-----------------------------------------------------------------------------------------
2014-12-09 13:49:36              
-----------------------------------------------------------------------------------------

btasker changed status from 'Open' to 'Resolved'
	
-----------------------------------------------------------------------------------------
2014-12-09 13:49:36              
-----------------------------------------------------------------------------------------

btasker added 'Fixed' to resolution
	
-----------------------------------------------------------------------------------------
2014-12-09 13:49:39              
-----------------------------------------------------------------------------------------

btasker changed status from 'Resolved' to 'Closed'
	
-----------------------------------------------------------------------------------------
2014-12-09 14:00:03              btasker
-----------------------------------------------------------------------------------------

Just to doublecheck - checked the current FD usage and it's now staying nice and low (so
the default 1024 FD limit would also be acceptable)


-- BEGIN SNIPPET --

root@test:~# lsof -p 7494 | wc -l
29

 -- END SNIPPET --