I recently built a cluster of HLS Muxing PXE Booted Raspberry Pi 3's.
They write to a remote NFS share, exported from a server in a datacentre and mounted over VPN (RTT averages around 150ms).
The root filesystem of the Pi's (and the tmpdir they use) is an NFS mount from a local NFS server (i.e. sub-millisecond RTTs).
I discovered that
ffmpeg has periodically been generating a Segmentation Fault.
It's not at all consistent, even across the same file, and there's next to nothing in the logs except for the fault being generated
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: ffmpeg command found.... continuing
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Bitrate options: -b:v 585k
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Generating HLS segments for bitrate 585k - this may take some time
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Bitrate options: -b:v 1171k
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: Generating HLS segments for bitrate 1171k - this may take some time
Jul 26 22:32:03 raspberrypi hlsmuxer[3612]: All transcoding processes started, awaiting completion
Jul 26 22:38:26 raspberrypi hlsmuxer[3612]: /home/pi/HLS-Stream-Creator/HLS-Stream-Creator.sh: line 123: 4073 Segmentation fault $FFMPEG $FFMPEG_INPUT_FLAGS -i "$infile" $PASSVAR -y -vcodec "$VIDEO_CODEC" -acodec "$AUDIO_CODEC" -threads "$NUMTHREADS" -map 0 -flags
Jul 26 22:38:26 raspberrypi hlsmuxer[3612]: Encoding for bitrate 1171k completed
Jul 26 22:40:55 raspberrypi hlsmuxer[3612]: /home/pi/HLS-Stream-Creator/HLS-Stream-Creator.sh: line 123: 4071 Segmentation fault $FFMPEG $FFMPEG_INPUT_FLAGS -i "$infile" $PASSVAR -y -vcodec "$VIDEO_CODEC" -acodec "$AUDIO_CODEC" -threads "$NUMTHREADS" -map 0 -flags
Jul 26 22:40:55 raspberrypi hlsmuxer[3612]: Encoding for bitrate 585k completed
There's no sign in
dmesg or anywhere else of any issues occurring (
oom-killer certainly hasn't run - and that gives a different logline anyway).
On the NFS server itself, there are lines like the following in
syslog
Jul 27 12:53:19 Ikaros kernel: [82334558.711911] NFSD: client 10.16.0.6 testing state ID with incorrect client ID
However, their timings do not coincide (even nearly) to any of the logged segfaults.
Initial thoughts were that the cause may be:
- Contention on the hardware encoder (
h264_omx)
- Some transient NFS fault/inavailability
Activity
2019-07-28 11:50:50
First run was a straight repro with no changes
Tests were planned and triggered with the following conditions:
- Trigger another test run *without
- Trigger another test run with segfaults passing the
- Trigger another test run with segfaults passing the
- Trigger another test run with
- Resolve the NFS Client ID collision and re-test
Leaving the Client ID collision to last was a deliberate choice as it would likely involve rebooting the Pi, and I didn't want to chance that a reboot would clear some other condition and make it seem like the collision was the cause.
Putting FFMPEG into verbose mode was also deliberately left until later due to the sheer volume of output that would need to be read over - particularly as segfault patterns were inconsistent
2019-07-28 11:55:50
So, the segfault is the result of
That meant that NFS was the underlying cause, but was it a product of the collisions, the latency or something else?
2019-07-28 12:02:37
No segfaults.
On the NFS server used for the Pi's root directories, the other 2 Pi's hostnames were changed (by editing
The reason for this being that, on Debian, the value of
The other 2 Pis were brought back up, and shares mounted, then another testrun was triggered, still no segfaults.
A final run was triggered with the other Pi's also being given encoding work (to ensure there were a high level of reads/writes to the share), no segfaults
2019-07-28 12:03:15
2019-07-28 14:08:50
2019-07-28 14:08:50
2019-07-28 14:08:53