MISC-34: FFMPEG Segfault on Clustered Pis writing to remote NFS Share



Issue Information

Issue Type: Bug
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Fixed (2019-07-28 14:08:50)

Created: 2019-07-28 11:46:38
Time Spent Working


Description
I recently built a cluster of HLS Muxing PXE Booted Raspberry Pi 3's.

They write to a remote NFS share, exported from a server in a datacentre and mounted over VPN (RTT averages around 150ms).

The root filesystem of the Pi's (and the tmpdir they use) is an NFS mount from a local NFS server (i.e. sub-millisecond RTTs).

I discovered that ffmpeg has periodically been generating a Segmentation Fault.

It's not at all consistent, even across the same file, and there's next to nothing in the logs except for the fault being generated
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: ffmpeg command found.... continuing
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Bitrate options: -b:v 585k
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Generating HLS segments for bitrate 585k - this may take some time
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Bitrate options: -b:v 1171k
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Generating HLS segments for bitrate 1171k - this may take some time
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: All transcoding processes started, awaiting completion
Jul 26 22:38:26 raspberrypi hlsmuxer[3612]: /home/pi/HLS-Stream-Creator/HLS-Stream-Creator.sh: line 123:  4073 Segmentation fault      $FFMPEG $FFMPEG_INPUT_FLAGS -i "$infile" $PASSVAR -y -vcodec "$VIDEO_CODEC" -acodec "$AUDIO_CODEC" -threads "$NUMTHREADS" -map 0 -flags 
Jul 26 22:38:26 raspberrypi hlsmuxer[3612]: Encoding for bitrate 1171k completed
Jul 26 22:40:55 raspberrypi hlsmuxer[3612]: /home/pi/HLS-Stream-Creator/HLS-Stream-Creator.sh: line 123:  4071 Segmentation fault      $FFMPEG $FFMPEG_INPUT_FLAGS -i "$infile" $PASSVAR -y -vcodec "$VIDEO_CODEC" -acodec "$AUDIO_CODEC" -threads "$NUMTHREADS" -map 0 -flags 
Jul 26 22:40:55 raspberrypi hlsmuxer[3612]: Encoding for bitrate 585k completed


There's no sign in dmesg or anywhere else of any issues occurring (oom-killer certainly hasn't run - and that gives a different logline anyway).

On the NFS server itself, there are lines like the following in syslog
Jul 27 12:53:19 Ikaros kernel: [82334558.711911] NFSD: client 10.16.0.6 testing state ID with incorrect client ID


However, their timings do not coincide (even nearly) to any of the logged segfaults.

Initial thoughts were that the cause may be:

- Contention on the hardware encoder (h264_omx)
- Some transient NFS fault/inavailability


Toggle State Changes

Activity


Created a test run with 10 iterations in order to try and reproduce the issue under various conditions.

First run was a straight repro with no changes
TESTRUN=1
echo '"Test run","Successful","Start time","End Time","Run Time"' | tee tmpdir/testrun-${TESTRUN}.csv 
for i in {1..10}
do

starttime=$( date +'%s' )
# Trigger the encode
./HLS-Stream-Creator/HLS-Stream-Creator.sh -i tmpdir/manualtest.mp4 -o remote/videos/test -b 1171,585 -p manifest -t media -s 5 

endtime=$( date +'%s' )

# See how many manifests contain EXT-X-ENDLIST 
success=$(grep EXT-X-ENDLIST remote/videos/test/*m3u8 | wc -l)

runtime=$(( $endtime - $starttime ))

# Write to a CSV Log
echo "$i,$success,$starttime,$endtime,$runtime" | tee -a tmpdir/testrun-${TESTRUN}.csv

done


Tests were planned and triggered with the following conditions:

- Trigger another test run *without h264_omx enabled - see if we get segfaults
- Trigger another test run with segfaults passing the -f flag to HLS-Stream-Creator so that we only mux one bitrate at a time
- Trigger another test run with segfaults passing the -f and setting NUMTHREADS to 1 so that we only mux one bitrate at a time, with just one thread
- Trigger another test run with ffmpeg in verbose mode to see if we catch anything useful before a segfault
- Resolve the NFS Client ID collision and re-test

Leaving the Client ID collision to last was a deliberate choice as it would likely involve rebooting the Pi, and I didn't want to chance that a reboot would clear some other condition and make it seem like the collision was the cause.

Putting FFMPEG into verbose mode was also deliberately left until later due to the sheer volume of output that would need to be read over - particularly as segfault patterns were inconsistent
Each of the first 3 test runs still resulted in segfaults, so the cause was not contention on the hardware encoder, or something else failing to deal with multiple threads/processes.

ffmpeg was then put into verbose mode, and when a segfault occurred the following was caught
[segment @ 0x1bf6650] Opening 'remote/videos/test/media_1171_00011.ts' for writing
[segment @ 0x1bf6650] Failed to open segment 'remote/videos/test/media_1171_00011.ts'
av_interleaved_write_frame(): Input/output error
./HLS-Stream-Creator/HLS-Stream-Creator.sh: line 123: 20398 Segmentation fault      $FFMPEG $FFMPEG_INPUT_FLAGS -i "$infile" $PASSVAR -y -vcodec "$VIDEO_CODEC" -acodec "$AUDIO_CODEC" -threads "$NUMTHREADS" -map 0 -flags -global_header -f segment -segment_list "$playlist_name" -segment_time "$SEGLENGTH" -segment_format mpeg_ts $resolution $bitrate $FFMPEG_ADDITIONAL $FFMPEG_FLAGS "$OUTPUT_DIRECTORY/$output_name"
Bitrate options: -b:v 585k
Generating HLS segments for bitrate 585k - this may take some time
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'tmpdir/manualtest.mp4':


So, the segfault is the result of av_interleaved_write_frame() failing to sanely handle an I/O error on the filesystem.

That meant that NFS was the underlying cause, but was it a product of the collisions, the latency or something else?

To avoid having to reboot the pi being used for testing, the other 2 were powered off (to remove any potential for conflict), and a test run triggered.

No segfaults.

On the NFS server used for the Pi's root directories, the other 2 Pi's hostnames were changed (by editing etc/hostname).

The reason for this being that, on Debian, the value of nfs4_unique_id appears to be derived from the hostname - https://lists.debian.org/debian-kernel/2016/03/msg00016.html. Due to an issue when provisioning the Pi's all had the same hostname and so were sending the same ID to the NFS server, hence:
Jul 27 12:53:19 Ikaros kernel: [82334558.711911] NFSD: client 10.16.0.6 testing state ID with incorrect client ID


The other 2 Pis were brought back up, and shares mounted, then another testrun was triggered, still no segfaults.

A final run was triggered with the other Pi's also being given encoding work (to ensure there were a high level of reads/writes to the share), no segfaults


So, despite the NFS server's entries in syslog not coinciding with the segfaults, it looks like the accidental duplication of NFS ID's was the underlying cause.
btasker changed status from 'Open' to 'Resolved'
btasker added 'Fixed' to resolution
btasker changed status from 'Resolved' to 'Closed'