PulseAudio Compressed Offload support ($514) · Snippets · freedesktop.org / Snippets

A description of use cases is missing. Is this relevant for desktop media players? I guess there's more interest at the embedded side - what kind of use cases do they have?

I'll add that in. It could be relevant in any case where the hardware exposes this functionality. This is more likely to be the case in embedded SoCs than desktop, but a number of modern laptop chipsets have the capability, but not exposed on Linux.

Currently the list of supported compressed formats comes from the user - is some other mechanism (like querying the hardware) expected for the new formats?

As before, you can get formats via the introspection API from a sink. In GStreamer, we actually create a "probe" stream, query sink info on whatever sink we are routed to, and then use that information. Having API to do this could be an improvement, but has some challenges (such as how do we know which sink will be selected?).

(I don't seem to get notifications for new comments in this snippet, so please ping me in IRC when you update something.)

I actually meant how a sink knows what formats it supports. Currently the user has to configure the formats manually, but autodetection would be much nicer, if there's some API for that.

How is the sink expected to schedule writes? Is it going to use a timer (in which case accurate frame duration information is crucial)?

It's unlikely to be timer-based. The ALSA compress offload API more or less just expects you to dump buffers in, as I understand it.

Sinks in IDLE state

It should be straightforward to have sinks that consume compressed data not render any data unless the sink state is RUNNING. Since no sinks currently take this approach (we render silence for all the PCM use-cases), we need to confirm that this is a sufficient condition.

Maybe a new rendering function is needed that assumes exactly one stream and doesn't generate silence on underruns? Or a new "no silence" flag for the existing rendering functions? A separate function seems more appropriate to me.

That's a good point, a separate render function seems to make more sense, indeed.

A buffer must contain an integral number of codec frames. For fine-grained render and a simple client API, it may make sense to have only one frame per buffer, although this might have a cost in terms of number of messages sent during playback.

My feeling is that yes, we should be able to separate each frame, but I don't think it complicates the API too much if the write function accepts multiple frames in one go. Hmm... or maybe it does make things a bit too complicated, since clients would have to deal with dynamically sized arrays, ideally without doing dynamic allocations. Maybe we can start with a simple single-frame write function and add a more complicated multi-frame version later if a need arises.

It would also be possible to have a function for queuing one frame at a time in the client code and then issuing a write as a separate step. That would probably involve more copying of data, but it would still avoid too frequent commands being sent over the protocol.

Yup, as you say, we can figure out a way to batch/optimise once we have the basics in place.

In practical terms, we add a duration field, in nanoseconds to the pa_memchunk structure. This decision is based on two factors:

We choose the time domain for the duration representation (rather than samples/frames/etc.) as that is the most direct, and does not need an understanding of the underlying format at all.

We expect nanoseconds to be sufficient granularity to represent a chunk of data at any reasonable sample rate without having to worry about the impact of conversion or approximation errors.

Nanoseconds seem overkill, but maybe better safe than sorry.

That was my thinking, especially to deal with irrational numbers and accumulating errors.

The least disruptive way to extend buffer-level metadata to add timestamps and durations appears to be via the pstream descriptor.

A drawback of this approach is the addition of one (and later two) 64-bit fields to the protocol, potentially increasing protocol overhead. There does not appear to be a viable alternative, so if this becomes problematic, optimisations for sending the data itself might need to be explored. Examples could include a smaller duration field (as the absolute duration value is unlikely to span the entire 64 bits).

Can you elaborate, what's the other 64-bit field in addition to the frame duration?

Can we avoid the additional fields with non-compressed streams? We'll anyway have to support old clients, so the fields can't be entirely mandatory.

No comment on whether pstream is the right place for modifications. I don't remember how that stuff works.

Unsigned 32-bit values would set the maximum frame length to about 4 seconds, assuming nanosecond granularity. Seems likely to be enough, but not really comfortably excessive to be sure that nobody will ever need support for larger frames.

The two values are duration and timestamp (I'm using the same infrastructure and API as they impact the same code paths).

I'm going to get the initial version out with those 2 64-bit fields, and we can pare it down if it looks necessary.

An important repercussion of this decision is that clients cannot just dump compressed data to PulseAudio. They must be able to understand the underlying frame structure to be able to provide buffer durations. It might make sense to relax this limitation if the client does not care about stream latency (which is mainly useful for A/V sync).

Maybe the clients can just provide dummy duration data. Won't work if the sink uses timer-based scheduling, though.

Duration can be 0, even, in that case. As I said above, timer-based scheduling is unlikely to be the way to go for compress offload anyway.

Now that I think about this again, providing dummy duration data may not be so useful after all. If we require the buffers to correspond to frames in the compressed stream, then the application has to have some understanding of the format details anyway, and if it's able to split the stream into frames, it's probably able to provide the duration data too.

If the ALSA API doesn't require the input data to be split at frame boundaries, then we might very well allow applications to supply arbitrarily sized frames. It would only be a recommendation to supply accurate duration data, so that A/V sync can be achieved, and sending multiple frames in one buffer would be allowed too, with the caveat that it has adverse effects on latency.

While it is not desirable to extend PulseAudio's built-in tools (mainly pacat) to be media-aware, it should be possible to allow for it to consume a raw compressed stream along with associated duration data, either in-band or in some sort of side-car format. This would unlikely be useful for anything other than limited testing.

I have trouble seeing how this is going to work. Sure, we can define a file format for supplying the duration data, but that scheme seems very cumbersome to use, to the point that nobody is going to use it. If pacat is going to have support for arbitrary compressed formats, I think it should do the necessary parsing for splitting the input into frames and extracting the duration data. Maybe it could utilize GStreamer for this in a general way (in which case the feature should be possible to disable during configuration).

I'll try to think about this. Using GStreamer seems to imply that you might as well just use the GStreamer tools, but perhaps there are ways to make things simpler.

For compressed streams, the client-side expectation would be that a stream volume change would be reflected on the stream. The simplest way to accomplish this would likely be for the sink to always reflect the current stream volume, and for that to also always be the sink volume. Some thought needs to go into the implications of this choice.

So the use case is providing per-stream volume control in a media player UI or pavucontrol? Or is there some other need for supporting stream volume? If we restore the old sink volume after the stream is finished playing, I don't see a problem with controlling the sink volume through the stream volume API, since the stream has exclusive access anyway. Or are there plans for supporting hardware mixing? However, I think we should start without any stream volume support - that can be added later.

Agreed, this will come as a separate step that we can reason about independently.

Sinks in IDLE state

Sinks in `IDLE` state