BHAYSTACKG-2: Current Test configuration exhausted File Descriptors on remote Web Server



Issue Information

Issue Type: Bug
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: BASH Haystack Generator (BHAYSTACKG)
Resolution: Fixed (2014-12-09 13:49:35)
Affects Version: 0.1,
Target version: 0.1,
Components: Testing ,

Created: 2014-12-08 11:02:23
Time Spent Working


Description
The test configuration did make a noticeable difference to the shape of traffic graphs.

However this was undermined as at roughly 2330 last night the remote server hit it's FD limit and stopped responding to requests (though it continued to accept requests).

The result was that the traffic graphs (measured client side) now show plenty of egress but no measurable ingress (and of course requests for other things aren't being satisfied either).

The configuration used is as follows

Cron

*/15 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null
*/5 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null


*test1 config

LOCKFILE='/tmp/haystackfile.lock'

# Endpoint configuration
REMOTE_FILE='http://server.test/haystackfile.img'
HTTP_HOSTNAME='haystack.test'

# Data Downloads
MAX_CHUNK_SIZE=6553600 # Maximum chunk size to request (bytes)
MAX_REQUESTS=100 # Maximum number of requests per session
RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session
USER_AGENT='HaystackGen V0.1'
MAX_DELAY=5 # maximum delay between requests

# Data uploads
SEND_DATA="r" # Send random data upstream?
MAX_UPSTREAM="655360" # bytes
RANDOMIZE_US_CHUNKSIZE="y" # use the same size upstream per session



test1-trickle config

LOCKFILE='/tmp/haystackfile.lock'

# Endpoint configuration
REMOTE_FILE='http://server.test/haystackfile.img'
HTTP_HOSTNAME='haystack.test'

# Data Downloads
MAX_CHUNK_SIZE=393216 # Maximum chunk size to request (bytes)
MAX_REQUESTS=1200 # Maximum number of requests per session
RANDOMIZE_CHUNKSIZE="y" # use the same size chunk-size throughout a session
USER_AGENT='HaystackGen V0.1'
MAX_DELAY=3 

# Data uploads
SEND_DATA="r" # Send random data upstream?
MAX_UPSTREAM="4096" # bytes
RANDOMIZE_US_CHUNKSIZE="n"


Obviously a server-side fix would be to increase FD limits, but want to see whether it's possible to achieve a config which has the desired effect on the traffic graphs without needing to change the default FD limits.

For reference, on the test server

$ ulimit -Hn
1024
$ ulimit -Sn
1024


Endpoint is running NGinx.


Issue Links

Toggle State Changes

Activity


The script was deliberately designed so that scheduled runs could run into each other, but I suspect I've made a miscalculation on the appropriate interval. Looking at the test client, there are 41 client instances running and requesting data

root@testclient ben]# ps aux # for i in `ps aux | grep "BASH_Haystack_generator" | awk '{print $2}'`;do kill $i; done


will check back in a bit to see what the count is
Definitely building up

root@testclient ben]# ps aux # ps aux | grep "BASH_Haystack_generator" | wc -l
10


Will need to do some fine-tuning I guess
btasker added '0.1' to Version
btasker added '0.1' to Fix Version
Have raised BHAYSTACKG-3 to do some more analysis on desirable traffic patterns/behaviour.

In the meantime as the generator is obviously able to run for longer than I expected

[root@testclient ben]# ps aux | grep "BASH_Haystack_generator" | wc -l
28


Am reducing the run frequencies in crontab

*/25 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null
*/15 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null


Suspect there'll still be some overlap, but it should at least reduce the effect of runaway build up
Worth noting that when NGinx hit it's FD limit, the log entry

[alert] 13735#0: accept() failed (24: Too many open files)


Was logged so many times that the error log hit 7GB
The scheduling change has reduced the build-up of overlapping runs, but the traffic levels on the tunnel are now very low.

From some of the work done in BHAYSTACKG-2, I think the best way forward is to adjust the script to have different 'profiles' so that it can ape the behaviour of some legitimate traffic patterns.
Just had a repeat event, interestingly, NGinx's error log went from 0 bytes to 7GB in 11 seconds (and there weren't that many requests during that time - 8 requests logged in access log).

It's possible the issue isn't what was first thought. On the offchance it's an NGinx specific issue, have disabled sending additional POST data upstream to see if it makes a difference (seems unlikely though)
Haven't seen a repeat yet, though there've also been significantly fewer runs overlapping.

Have configured a second test server and am also running against that to see whether the issue can be reproduced.
No repeat event on either server overnight.

Have re-enabled POST (i.e. set SEND_UPSTREAM to 'r') in the client for one server and see whether we then see the issue again. Have also re-increased the frequency in crontab to increase the likelihood of reproducing the issue

*/15 * * * *    ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1 > /dev/null
*/5 * * * *     ~/BASH_Haystack_generator/client-side/request_generator.sh -c test1-trickle > /dev/null


If/when the issue re-occurs, will check NGinx's FD usage to confirm for definite whether it's an issue of hitting file descriptor limits.
Definitely a FD issue

root@test:/etc/security# lsof -p 32118 | wc -l
1042

root@test:/etc/security# cat /proc/32118/limits | grep "Max open files"
Max open files            1024                 1024                 files 


Looking at the traffic patterns generated by running the script less frequently; the result is pretty undesirable as it's possible to roughly identify when the script is likely to be running, so the increased frequency of runs is currently required.

Until the generated traffic patterns get a bit smarter (BHAYSTACKG-3), the only real solution is to up the FD limit (1024 is really low anyway) so that an appropriate frequency of checks can be maintained.

Have set the FD limit to 10,000 on the test server, will check back in a while to see what number is actually used

Although there are only currently 6 sessions running, NGinx seems to have a lot of sockets sat at CLOSE_WAIT

root@tora:/etc/nginx# lsof -p 884 | grep CLOSE_WAIT | wc -l
849


It's also using a lot of file descriptors for the haystack file

root@tora:/etc/nginx# lsof -p 884 | grep haystackfile | wc -l
2042


After waiting a little bit, NGinx has also started complaining it's run out of workers

2014/12/09 11:44:08 


On the server

[root@test ~'>alert] 884#0: 1024 worker_connections are not enough
2014/12/09 11:44:08 [alert] 884#0: 1024 worker_connections are not enough
2014/12/09 11:44:09 [alert] 884#0: 1024 worker_connections are not enough


I'm going to go out on a limb and guess that because NGinx thinks its using a persistent connection (the default for HTTP 1.1 being keep-alive) it's not closing the connection, and for some reason the FIN isn't doing it (which would be a bug in NGinx if that's the case).

Adding a 'Connection: close' header to the response, but disabling all runs first so NGinx can be restarted to make counting the number of CLOSE_WAITs a little easier.


[root@testclient Haystack_configs]# touch /tmp/haystackfile.lock
[root@testclient Haystack_configs]# for i in `ps aux # service nginx restart
root@test ~]# lsof -p 7494 
        then
                # Place the request
                curl -H "Connection: close" -H "User-agent: $USER_AGENT" -H "Host: $HOST" -H "Range: bytes=$START-$END" "$URL" > /dev/null
        else
                curl -X POST -d "$DATA" -H "Connection: close" -H "User-agent: $USER_AGENT" -H "Host: $HOST" -H "Range: bytes=$START-$END" "$URL" > /dev/null
        fi


and removed the lockfile

[root@testclient Haystack_configs]# rm -f /tmp/haystackfile.lock


Next scheduled cronjob should kick in in about 4 minutes
The scheduled cron has fired, and I've also manually triggered some runs



[root@test:~'>root@testclient ~]# ps aux # lsof -p 7494 | grep "CLOSE_WAIT" | wc -l
0


There are runs occuring that are using POST and also others using GET, so it looks like it's a bug in NGinx, though the version in the repos appears to be quite old

[root@test:~]# nginx -v
nginx version: nginx/0.7.67


Will look at compiling a newer version of NGinx in a while to re-test, but it seems best to pre-empt this bug within the client script anyway by adding a Connection header.
Note: Have moved the codebase over to GitHub so that the commit notifications link to a publicly accessible repo

Repo: BASH-Haystack-Generator
Commit: aca5d7a8212c45ff25dc1e99497e119329bb9ee5
Author: Ben Tasker <github@<Domain Hidden>>

Date: Tue Dec 09 13:40:21 2014 +0000
Commit Message: Disabled keep-alive connections to work around an NGinx bug. See BHAYSTACKG-2



Modified (-)(+)
-------
client-side/request_generator.sh




Webhook User-Agent

GitHub-Hookshot/dd9ba0b


View Commit

Marking this issue as fixed as the root cause has been identified and worked around.
btasker changed status from 'Open' to 'Resolved'
btasker added 'Fixed' to resolution
btasker changed status from 'Resolved' to 'Closed'
Just to doublecheck - checked the current FD usage and it's now staying nice and low (so the default 1024 FD limit would also be acceptable)

root@test:~# lsof -p 7494 | wc -l
29