########################################################################################## BHAYSTACKG-2: Current Test configuration exhausted File Descriptors on remote Web Server ########################################################################################## Issue Type: Bug ----------------------------------------------------------------------------------------- Issue Information ==================== Priority: Major Status: Closed Resolution: Fixed (2014-12-09 13:49:35) Project: BASH Haystack Generator (BHAYSTACKG) Reported By: btasker Assigned To: btasker Components: - Testing Affected Versions: - 0.1 Targeted for fix in version: - 0.1 Time Estimate: 0 minutes Time Logged: 0 minutes ----------------------------------------------------------------------------------------- Issue Description ================== The test configuration did make a noticeable difference to the shape of traffic graphs. However this was undermined as at roughly 2330 last night the remote server hit it's FD limit and stopped responding to requests (though it continued to accept requests). The result was that the traffic graphs (measured client side) now show plenty of egress but no measurable ingress (and of course requests for other things aren't being satisfied either). The configuration used is as follows *Cron* -- BEGIN SNIPPET -- */15 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null */5 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null -- END SNIPPET -- *test1 config* -- BEGIN SNIPPET -- LOCKFILE='/tmp/haystackfile.lock' # Endpoint configuration REMOTE_FILE='http://server.test/haystackfile.img' HTTP_HOSTNAME='haystack.test' # Data Downloads MAX_CHUNK_SIZE=6553600 # Maximum chunk size to request (bytes) MAX_REQUESTS=100 # Maximum number of requests per session RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session USER_AGENT='HaystackGen V0.1' MAX_DELAY=5 # maximum delay between requests # Data uploads SEND_DATA="r" # Send random data upstream? MAX_UPSTREAM="655360" # bytes RANDOMIZE_US_CHUNKSIZE="y" # use the same size upstream per session -- END SNIPPET -- *test1-trickle config* -- BEGIN SNIPPET -- LOCKFILE='/tmp/haystackfile.lock' # Endpoint configuration REMOTE_FILE='http://server.test/haystackfile.img' HTTP_HOSTNAME='haystack.test' # Data Downloads MAX_CHUNK_SIZE=393216 # Maximum chunk size to request (bytes) MAX_REQUESTS=1200 # Maximum number of requests per session RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session USER_AGENT='HaystackGen V0.1' MAX_DELAY=3 # Data uploads SEND_DATA="r" # Send random data upstream? MAX_UPSTREAM="4096" # bytes RANDOMIZE_US_CHUNKSIZE="n" -- END SNIPPET -- Obviously a server-side fix would be to increase FD limits, but want to see whether it's possible to achieve a config which has the desired effect on the traffic graphs without needing to change the default FD limits. For reference, on the test server -- BEGIN SNIPPET -- $ ulimit -Hn 1024 $ ulimit -Sn 1024 -- END SNIPPET -- Endpoint is running NGinx. ----------------------------------------------------------------------------------------- Issue Relations ================ - relates to BHAYSTACKG-3: Analyse desirable traffic behaviour ----------------------------------------------------------------------------------------- Activity ========== ----------------------------------------------------------------------------------------- 2014-12-08 11:23:11 btasker ----------------------------------------------------------------------------------------- The script was deliberately designed so that scheduled runs could run into each other, but I suspect I've made a miscalculation on the appropriate interval. Looking at the test client, there are 41 client instances running and requesting data -- BEGIN SNIPPET -- [root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l 41 -- END SNIPPET -- The vast majority of which are the trickle script. Some are probably stuck as a result of the issues with the remote server, so have killed the lot -- BEGIN SNIPPET -- [root@testclient ben]# for i in `ps aux | grep "BASH_Haystack_generator" | awk '{print $2}'`;do kill $i; done -- END SNIPPET -- will check back in a bit to see what the count is ----------------------------------------------------------------------------------------- 2014-12-08 12:24:20 btasker ----------------------------------------------------------------------------------------- Definitely building up -- BEGIN SNIPPET -- [root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l 16 [root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l 10 -- END SNIPPET -- Will need to do some fine-tuning I guess ----------------------------------------------------------------------------------------- 2014-12-08 13:49:29 ----------------------------------------------------------------------------------------- btasker added '0.1' to Version ----------------------------------------------------------------------------------------- 2014-12-08 13:49:29 ----------------------------------------------------------------------------------------- btasker added '0.1' to Fix Version ----------------------------------------------------------------------------------------- 2014-12-08 13:52:46 btasker ----------------------------------------------------------------------------------------- Have raised BHAYSTACKG-3 to do some more analysis on desirable traffic patterns/behaviour. In the meantime as the generator is obviously able to run for longer than I expected -- BEGIN SNIPPET -- [root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l 28 -- END SNIPPET -- Am reducing the run frequencies in crontab -- BEGIN SNIPPET -- */25 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null */15 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null -- END SNIPPET -- Suspect there'll still be some overlap, but it should at least reduce the effect of runaway build up ----------------------------------------------------------------------------------------- 2014-12-08 13:55:39 btasker ----------------------------------------------------------------------------------------- Worth noting that when NGinx hit it's FD limit, the log entry -- BEGIN SNIPPET -- [alert] 13735#0: accept() failed (24: Too many open files) -- END SNIPPET -- Was logged so many times that the error log hit 7GB ----------------------------------------------------------------------------------------- 2014-12-08 14:39:44 btasker ----------------------------------------------------------------------------------------- The scheduling change has reduced the build-up of overlapping runs, but the traffic levels on the tunnel are now very low. From some of the work done in BHAYSTACKG-2, I think the best way forward is to adjust the script to have different 'profiles' so that it can ape the behaviour of some legitimate traffic patterns. ----------------------------------------------------------------------------------------- 2014-12-08 15:28:28 btasker ----------------------------------------------------------------------------------------- Just had a repeat event, interestingly, NGinx's error log went from 0 bytes to 7GB in 11 seconds (and there weren't that many requests during that time - 8 requests logged in access log). It's possible the issue isn't what was first thought. On the offchance it's an NGinx specific issue, have disabled sending additional POST data upstream to see if it makes a difference (seems unlikely though) ----------------------------------------------------------------------------------------- 2014-12-08 23:19:46 btasker ----------------------------------------------------------------------------------------- Haven't seen a repeat yet, though there've also been significantly fewer runs overlapping. Have configured a second test server and am also running against that to see whether the issue can be reproduced. ----------------------------------------------------------------------------------------- 2014-12-09 09:56:57 btasker ----------------------------------------------------------------------------------------- No repeat event on either server overnight. Have re-enabled POST (i.e. set SEND_UPSTREAM to 'r') in the client for one server and see whether we then see the issue again. Have also re-increased the frequency in crontab to increase the likelihood of reproducing the issue -- BEGIN SNIPPET -- */15 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null */5 * * * * ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null -- END SNIPPET -- If/when the issue re-occurs, will check NGinx's FD usage to confirm for definite whether it's an issue of hitting file descriptor limits. ----------------------------------------------------------------------------------------- 2014-12-09 10:58:34 btasker ----------------------------------------------------------------------------------------- Definitely a FD issue -- BEGIN SNIPPET -- root@test:/etc/security# lsof -p 32118 | wc -l 1042 root@test:/etc/security# cat /proc/32118/limits | grep "Max open files" Max open files 1024 1024 files -- END SNIPPET -- Looking at the traffic patterns generated by running the script less frequently; the result is pretty undesirable as it's possible to roughly identify when the script is likely to be running, so the increased frequency of runs is currently required. Until the generated traffic patterns get a bit smarter (BHAYSTACKG-3), the only real solution is to up the FD limit (1024 is really low anyway) so that an appropriate frequency of checks can be maintained. Have set the FD limit to 10,000 on the test server, will check back in a while to see what number is actually used ----------------------------------------------------------------------------------------- 2014-12-09 13:17:14 btasker ----------------------------------------------------------------------------------------- Although there are only currently 6 sessions running, NGinx seems to have a lot of sockets sat at CLOSE_WAIT -- BEGIN SNIPPET -- root@tora:/etc/nginx# lsof -p 884 | grep CLOSE_WAIT | wc -l 849 -- END SNIPPET -- It's also using a lot of file descriptors for the haystack file -- BEGIN SNIPPET -- root@tora:/etc/nginx# lsof -p 884 | grep haystackfile | wc -l 2042 -- END SNIPPET -- After waiting a little bit, NGinx has also started complaining it's run out of workers -- BEGIN SNIPPET -- 2014/12/09 11:44:08 [alert] 884#0: 1024 worker_connections are not enough 2014/12/09 11:44:08 [alert] 884#0: 1024 worker_connections are not enough 2014/12/09 11:44:09 [alert] 884#0: 1024 worker_connections are not enough -- END SNIPPET -- I'm going to go out on a limb and guess that because NGinx thinks its using a persistent connection (the default for HTTP 1.1 being keep-alive) it's not closing the connection, and for some reason the FIN isn't doing it (which would be a bug in NGinx if that's the case). Adding a 'Connection: close' header to the response, but disabling all runs first so NGinx can be restarted to make counting the number of CLOSE_WAITs a little easier. -- BEGIN SNIPPET -- [root@testclient Haystack_configs]# touch /tmp/haystackfile.lock [root@testclient Haystack_configs]# for i in `ps aux | grep Haystack | awk '{print $2}'`; do kill $i; done -- END SNIPPET -- On the server -- BEGIN SNIPPET -- [root@test ~]# service nginx restart [root@test ~]# lsof -p 7494 | grep "CLOSE_WAIT" | wc -l 0 -- END SNIPPET -- Updated the relevant lines in the placeRequest function -- BEGIN SNIPPET -- if [ "$UPSTREAM" == 'n' ] then # Place the request curl -H "Connection: close" -H "User-agent: $USER_AGENT" -H "Host: $HOST" -H "Range: bytes=$START-$END" "$URL" > /dev/null else curl -X POST -d "$DATA" -H "Connection: close" -H "User-agent: $USER_AGENT" -H "Host: $HOST" -H "Range: bytes=$START-$END" "$URL" > /dev/null fi -- END SNIPPET -- and removed the lockfile -- BEGIN SNIPPET -- [root@testclient Haystack_configs]# rm -f /tmp/haystackfile.lock -- END SNIPPET -- Next scheduled cronjob should kick in in about 4 minutes ----------------------------------------------------------------------------------------- 2014-12-09 13:27:04 btasker ----------------------------------------------------------------------------------------- The scheduled cron has fired, and I've also manually triggered some runs -- BEGIN SNIPPET -- [root@testclient ~]# ps aux | grep configs/test | wc -l 5 -- END SNIPPET -- The server isn't currently showing any sockets in CLOSE_WAIT -- BEGIN SNIPPET -- [root@test:~]# lsof -p 7494 | grep "CLOSE_WAIT" | wc -l 0 -- END SNIPPET -- There are runs occuring that are using POST and also others using GET, so it looks like it's a bug in NGinx, though the version in the repos appears to be quite old -- BEGIN SNIPPET -- [root@test:~]# nginx -v nginx version: nginx/0.7.67 -- END SNIPPET -- Will look at compiling a newer version of NGinx in a while to re-test, but it seems best to pre-empt this bug within the client script anyway by adding a _Connection_ header. ----------------------------------------------------------------------------------------- 2014-12-09 13:40:15 btasker ----------------------------------------------------------------------------------------- Note: Have moved the codebase over to GitHub so that the commit notifications link to a publicly accessible repo ----------------------------------------------------------------------------------------- 2014-12-09 13:43:09 git ----------------------------------------------------------------------------------------- -- BEGIN QUOTE -- Repo: BASH-Haystack-Generator Commit: aca5d7a8212c45ff25dc1e99497e119329bb9ee5 Author: Ben Tasker