mp4mux: interleave audio and video in fragments, and reduce interleave
Submitted by Richard Mitic
Link to original bug (#734413)
Description
The following pipeline will produce a multiplexed video/audio mp4 file with 1-second fragments.
gst-launch-1.0 mp4mux name=mux fragment-duration=1000 ! filesink location=out.mp4 videotestsrc num-buffers=1500 ! "video/x-raw,framerate=25/1" ! x264enc tune=zerolatency ! mux. audiotestsrc num-buffers=2812 ! "audio/x-raw,rate=48000" ! voaacenc ! mux.
Currently, each media stream is muxed as a separate movie fragment, i.e. for two stream A and V, mp4mux produces A1 V1 A2 V2 A3 V3 etc. However, there is a discrepancy in the length of each associated audio and video fragment, due to the fact that only an integer number of audio or video frames can be packed into a fragment.
In this example, the video fragments contain 25 frames equaling exactly 1 second, but the audio fragments contain 46 frames, equaling 0.9813333 seconds. This error is corrected for by adding two audio fragments in a row when there is enough data buffered. The result is that AN and VN will not necessarily contain media that starts at the same time, with the maximum error being the length of a fragment.
To fix this, the muxer could vary the number of audio frames in each fragment so that the fragment boundary is as close as possible in time to the video fragment boundary. Ideally, the audio and video media data would also be placed in the same ‘mdat’ box, with the associated 'moof' box containing one track fragment for each stream. Just aligning the movie fragment boundaries at the frame level is acceptable though.