xdg-pip: Add new protocol

changed the description

Should it be part of the xdg-shell protocol?

IMO, it should be a separate extension, using the interfaces (xdg_surface) and semantics (configure/ack_configure) from xdg-shell.

"excluded from listing in the taskbar or docks" → from the discussion at #44, it seems like this doesn't need this new protocol addition and is just a KDE bug. If it's not relevant please remove it from the commit message?

Is picture-in-picture the only thing which can benefit from these new window semantics? Or are there other use-cases as well? Picture-in-picture seems like a pretty narrow use-case.

IMO, it should be a separate extension, using the interfaces (xdg_surface) and semantics (configure/ack_configure) from xdg-shell.

Yup, I'd prefer this as well, so that we can supersede it later with something else as needed.

xdg_pip is built on top of xdg_surface because it gives us support for client-side drop shadows out of the box.

+1

added 1 commit

a2cf1b4a - xdg-pip: Add new protocol

Compare with previous version

changed the description

Just to be pedantic, I don't think this is "picture-in-picture", any more than having any two surfaces visible is "picture-in-picture"...

It seems to just be an overlay window as a midway between xdg-shell and layer-shell, and alternative uses (either intentional or not) would likely be:

Ongoing call and desktop sharing control mini-windows
HUDs (e.g. GPU/game stats and what-not)
Weird custom notifications, progress indicators, etc.

There are most likely others. I suspect the first one would end up being the primary consumer of this feature.

(Note that a compositor could allow the user to easily "pin" a window to make anything an overlay without this protocol, to e.g. keeping a video player on top, placed in the corner.)

[1]: imho, using sub-surfaces or layer-shell would be a better match. If the desktop sharing controls should be attached to the window - use subsurfaces; If the desktop sharing controls need to be displayed elsewhere - use layer-shell, e.g. to place it centered, etc

[2]: similar to the sharing controls, it seems like it's better to use subsurfaces and layershell, e.g. game stats - subsurface, HUD (as a desktop shell concept) - layer-shell

[3]: Ideally notifications and things as such should go through the desktop shell, i.e. layer-shell would shine here.

layer-shell is not intended for any of these use-cases (note that by custom notification I meant "built-in" rather than proper dbus notifications - notification daemons should of course use layer shell), and I was of course implying that the content was separated and hovering in a corner as a permanent overlay, so subsurfaces are not an option.

Desktop sharing controls and call indicators/controls often behave like this on other platforms (you generally hide the "main" call window when sharing your desktop). Similar to the "bubble" you get on smartphones for calls/screen sharing.

and I was of course implying that the content was separated and hovering in a corner as a permanent overlay, so subsurfaces are not an option.

If an app does so, it asks for inconsistency problems. This doesn't look so different from what the desktop shell would provide.

(Note that a compositor could allow the user to easily "pin" a window to make anything an overlay without this protocol, to e.g. keeping a video player on top, placed in the corner.)

Can you elaborate on this?

(Note that a compositor could allow the user to easily "pin" a window to make anything an overlay without this protocol, to e.g. keeping a video player on top, placed in the corner.)

The point of this protocol is not having the user manually need to use general window management primitives to implement some kind of behavior similar to what PiP usually intends to provide.

There are most likely others. I suspect the first one would end up being the primary consumer of this feature.

It's likely that the first and primary user of this interface would be Firefox's PiP feature that currently maps to a regular toplevel window under Wayland.

@zzag I don't think any of the use cases should rely on the layer shell. My understanding of layer shell is to allow building a desktop enviroment by letting panels, notification daemons etc layout their surfaces correctly, it should never target "regular" applications.

@zzag If an app does so, it asks for inconsistency problems.

I don't understand what you mean here. Having an application separate some content - be it a video player section, controls or notifications - from its main window and making it hover in the corner is exactly what Firefox wishes to do, so why would it lead to consistency problems if other applications do the same?

@jadahl The point of this protocol is not having the user manually need to use general window management primitives to implement some kind of behavior similar to what PiP usually intends to provide.

Yeah, I was just pointing out that there are definitely other use cases, and that I at least personally find some of those stronger and/or more important. "Just" getting a regular video player window to stick seems like more of a normal WM task from a user perspective, especially as a user might want to do this for any number of applications and content - not just videos in Firefox and other special-cased content. On the other hand, a user definitely wouldn't want to mess with WM controls for every screen sharing and call indicator that shows up that should be hovering in the corner.

@kennylevinsen I mean that you shouldn't try to reinvent the wheel by rolling out your own notification system instead of relying on the desktop environment services. Perhaps I'm too focused on a concrete detail.

Yeah, I was just pointing out that there are definitely other use cases, and that I at least personally find some of those stronger and/or more important. ...

Yea, I don't think this protocol should make any assumptions about the content itself, be it video or a broken xeyes clone.

Maybe we should call it "overlay window", or maybe "pinned window" (protocol name adjusted to fit)?

I know Firefox named their feature "picture-in-picture" because their use resembles having a monitor or TV show multiple input sources simultaneously, but the other reasonable use cases do not bear any PiP resemblance at all. A Wayland compositor is (almost) always showing multiple "input sources" simultaneously, and so it's only the overlay/pinned behavior that is special to this protocol/role.

"picture-in-picture" was primarily chosen because it's a well recognizable name, all major web browsers (edit: and one mobile operating system) have already done the branding work. "pinned" is a good name, but it makes pinning explicit, imho it's more future proof to omit window placement. with "overlay", my concern is that people may misunderstand it as a replacement for _NET_WM_STATE_ABOVE

all major web browsers have already done the branding work.

We luckily do not need branding work for wayland protocol names. :)

"pinned" is a good name, but it makes pinning explicit, imho it's more future proof to omit window placement

"Pinned" could also just mean in respect to Z order. Note that the term "picture-in-picture" strongly implies fixed window positioning in one of the four output corners.

with "overlay", my concern is that people may misunderstand it as a replacement for _NET_WM_STATE_ABOVE

I don't think that would be a misunderstanding. The motivation for the development of this protocol is strictly to let Firefox do under Wayland what it uses _NET_WM_STATE_ABOVE for under X11, so "replacement for _NET_WM_STATE_ABOVE" seems to be an accurate description (even if we end up having _NET_WM_STATE_SKIP_TASKBAR behavior baked into it as well).

This protocol lets any app force any arbitrary window in front of all others. That's a horrible situation IMO. The whole point of a PiP protocol was to restrict what can be done with the surface so that it makes evil use cases impossible (like custom notifcations, clickable ads, …).

We probably should just follow what Android and iOS do: the surfaces receive no ordinary input but declare what kind of actions it wants (https://github.com/flatpak/xdg-desktop-portal/issues/612#issuecomment-968127614).

This protocol lets any app force any arbitrary window in front of all others. That's a horrible situation IMO. The whole point of a PiP protocol was to restrict what can be done with the surface so that it makes evil use cases impossible (like custom notifcations, clickable ads, …).

One idea has been to do as with keyboard shortcut inhibitations where one have to grant permission the first time an application attempts. The extra dialog-nag might limit the awkward usage to where it's actually wanted. Custom notifications I don't think it'll be used for since there are zero guarantees of it being placed in any way a notification would want to be placed.

Annoying dialogs, ads etc, is harder indeed.

Limiting to certain input is an interesting idea, would one declare regions of the surface that correspond to what kind of event? Or how does e.g. the pause button in the Firefox PiP window result in the pause action?

Maybe we should call it "overlay window", or maybe "pinned window" (protocol name adjusted to fit)?

Should probably avoid codifying per compositor policy into the name itself. Picture-in-picture is nicely vague but one still understand what it means, since it's already a existing concept.

the term "picture-in-picture" strongly implies fixed window positioning in one of the four output corners

meh, not really in practice. To my knowledge only iOS forces PiP views to stick to corners. Android lets you move them freely, and of course you can move them freely on Windows/X11 where browsers just use existing primitives.

Annoying dialogs, ads etc

This would be a good problem to have! ;) Currently we do not have a precedent for Wayland clients being annoying like this (heck, I haven't seen that on X11 either). But we have a problem of restrictiveness causing people to revert to X11 and declare Wayland "perpetually not ready".

IMO we don't need to restrict annoyance so much. But we can and should restrict "truly evil" things — PiP views should be limited in size and transparency by compositor policy, and they can't globally position themselves anyway.

just follow what Android and iOS do: the surfaces receive no ordinary input but declare what kind of actions it wants

This would be really hard. Cross-platform software, used to what X11/Windows/macOS allow, absolutely expects normal pointer&touch input on these windows on desktop… and many compositor developers certainly do not expect having to suddenly implement overlay-UI-drawing for these actions! Not everyone is GNOME Shell — e.g. in Wayfire we do not have an in-compositor UI framework and we don't really want one.

cross-platform software, used to what X11/Windows/macOS allow, absolutely expects normal pointer&touch input on thes windows on desktop

What existing software expects and what is good don't necessarily overlap. Wayland is neither only for the desktop nor is making existing software work in all aspects one of its goals. It should enable use cases, not existing (bad) implementations of use cases.

and many compositor developers certainly do not expect having to suddenly implement overlay-UI-drawing for these actions!

It doesn't have to be the compositors ui, the PiP content doesn't even have to be a wayland surface. With xdg portals one could for example mandate PiP windows to be content streams instead, where the portal creates the actual surface and provides fitting controls.

That said, I think the surface for abuse is pretty small (pun not intended). It might be a good idea to revise the whole thing later on but this protocol doesn't prevent that from happening.

This would be a good problem to have! ;)

I do. Steam for example.

But we have a problem of restrictiveness causing people to revert to X11 and declare Wayland "perpetually not ready".

When the whole system is "restrictive" why should we make an exception for PiP?

Cross-platform software, used to what X11/Windows/macOS allow, absolutely expects normal pointer&touch input on these windows on desktop

I mean, yeah? There is no other way to implement it on those platforms so obviously they do it that way. It doesn't mean they can't implement their use cases properly with the restricted protocol.

and many compositor developers certainly do not expect having to suddenly implement overlay-UI-drawing for these actions!

I'm sure the wlroots people can figure out how to outsource that to a system client with layer shell.

Limiting to certain input is an interesting idea, would one declare regions of the surface that correspond to what kind of event? Or how does e.g. the pause button in the Firefox PiP window result in the pause action?

If the client is playing a video it can for example request Play, Pause, Stop actions. The compositor is then responsible for putting fitting UI on top of the PiP surface and emitting events for the actions back to the client.

As it was said above, that can be really difficult for some compositors, which is a major blocker for adopting this protocol.

Annoying dialogs, ads etc, is harder indeed.

Almost every surface role can be abused by rogue applications. An xdg_toplevel can be mapped every N seconds to show whatever the client wants. Also, this looks like a hypothetical problem rather than something that people face regularly.

As it was said above, that can be really difficult for some compositors, which is a major blocker for adopting this protocol.

Implementing anything can be difficult for some compositors. If you see an actual issue with wlroots implementing this I would love to know.

Almost every surface role can be abused by rogue applications.

Sure, some things can be abused but that happens because we don't know how to prevent it from happening yet while also giving clients the features they need. PiP is different because we know how to stop clients from abusing it. All your arguments against designing something that can't be abused are that it's harder to implement than something that can be abused. IMO that's a very weak argument.

While clickable ad/malware windows are bad, that's rather an extreme case. And even with server-side picture-in-picture controls, one can easily defeat those limitations by just using another surface role protocol. So, we end up with a protocol that requires a load of complexity in the compositor for too little gain imho.

This would be a good problem to have! ;) Currently we do not have a precedent for Wayland clients being annoying like this

@swick Arent there malware on windows and android out there that allow you to do that already? Android actually no longer allows apps to take focus at any time. That is the main reason KDE Connect's send to device feature can no longer open the content on phone and instead shows a notification to do that. KDE Connect can register itself as a companion app to work around this (it is after all) but thats a different story.

We probably should just follow what Android and iOS do: the surfaces receive no ordinary input but declare what kind of actions it wants (https://github.com/flatpak/xdg-desktop-portal/issues/612#issuecomment-968127614).

I would prefer this implementation as this is what All apple systems - iOS, iPadOS, macOS and Android does. This also means that while apps can have PiP window flexible enough to adapt to their content, they cant take control of your system.

Allowing xdg-pip to to just be a normal surface with added ability of being always on top would mean apps will eventually misappropriate of this.

For example: Microsoft Teams for some weird reason thinks that it needs to be the center of all attention and forces the sign-in window above everything using X features. This is something I would want to do away with. Also, Steam.

Cross-platform software, used to what X11/Windows/macOS allow, absolutely expects normal pointer&touch input on these windows on desktop

macOS does now have an actual PIP mode that works more like other systems (see link above). Microsoft has AppWindowPresentationKind to set the window type but allows interaction at present. That may change in future though. I cant think of any X11 (or any other app really) that does something with PIP that requires interactive control. Windows would be the only other platform that allows PIP surfaces to be interactive and by the looks of it, even they may be working to change that.

and many compositor developers certainly do not expect having to suddenly implement overlay-UI-drawing for these actions!

@Zamundaaa Either showing controls on window border or having it part of a library like wlroots may make sense.

One idea has been to do as with keyboard shortcut inhibitations where one have to grant permission the first time an application attempts.

@jadahl This could still mean an apps may misappropriate this permission. Once again, Microsoft Teams would be a great example.

I do. Steam for example.

Oof! I would love for Steam to no longer be able to do this.

macOS does now have an actual PIP mode that works more like other systems (see link above).

In that link, the implementation details of pip mode were described from the perspective of a user of the native app framework. Can you confirm that no magic happens in the corresponding app framework? e.g. drawing controls client-side and redirecting input to the overlay with media controls.

Neither Steam nor Microsoft Teams can use this protocol to show their sign in/up forms as the placement of pip surfaces is subject to compositor policies, e.g. pip can be placed in a screen corner, which is arguably not a good spot for a sign up form.

From the Android PiP support page:

Users cannot interact with your app's UI elements when it's in PiP mode and the details of small UI elements may be difficult to see. Video playback activities with minimal UI provide the best user experience.

If your app needs to provide custom actions for PiP, see Add controls in this document. Remove other UI elements before your activity enters PiP and restore them when your activity becomes fullscreen again

So I'm pretty sure that at least on Android the PiP surface receives no input and the system draws an overlay and handles input.

@zzag According to https://developer.apple.com/documentation/avkit/adopting_picture_in_picture_in_a_custom_player you can start PiP from custom player but i couldnt find anything that allows you to customize the player in any way. the framework will use automagically the controls that are declared for other use cases (system controls and state management, etc.)

I have not seen any app that use custom controls of any sort either. The player will provide its own overlay which passes controls to your app.

@swick yes that would be correct from my experience on android. PiP mode will NOT allow the app to dictate placement or size or to receive input directly. You can however read the current size of PiP to reorder contents. The content can be drawn by the app though and thats how maps and other non-media app show content in PiP. Apple goes a step further and has built PiP in their AVKit framework allowing only media content and calls to be placed in PiP.

I agree that making PiP receive no input seems like a good idea after all, especially if portable applications already have to deal with this limitation on other platforms. Given that there are applications that actively make intrusive dialogs using X11 that would be potentially (and at least partially) implementable with PiP + receiving input is probably a good enough reason to not let them have the chance. The query-nag I mentioned I tend to agree is a bad solution to the problem.

However, there are some things that needs consideration if this is the path we take.

Looking at the Android documentation, it seems expected that "media controls", i.e. play/pause/seek is a special case, and on iOS, media controls and phone control. How would one deal with this? Associating with a MPRIS2 instance? Some dumbed down "event" mechanism (e.g. action('play'), action('pause') events? The other thing Android seems to provide is a list of arbitrary actions, where each action is accompanied with an icon. Passing icon file paths seems like a bad idea; so to achieve this one would have to have ways to provide overlay surfaces (one per icon I imagine), that would then be laid out by the compositor.

Maybe a first step in a version 1 could be to either just provide special media and phone controls or nothing at all (only "dismiss").

Then there is another PiP feature that seems somewhat hard to support as is - transitioning between PiP and toplevel, i.e. make it possible to do animations similar to maximize/unmaximize animations. It'd also mean having to distinguish between "dismiss" and "un-pip". Probably not something to bother with initially anyhow, other than thinking of how to not make it unimplementable.

At which granularity the controls can be requested is a good question. Do we want something high level like "seekable video", "video", "phone call" or do we want clients to request individual controls and let the compositor figure out how to combine them usefully. If we even want to let clients set the icons of custom controls at some point the second approach seems more suitable.

Requesting the controls could be done with requests before the PiP surface role is set and then the PiP "subclass" just sends events back when the controls are activated.

There is a few other hairy bits: we probably don't want to allow any alpha in there (i.e. interpret everything as opaque), the PiP surface can't have drop shadows and does not handle resizing.

I imagine one would want to overlay certain sets of elements depending what they are (when possible), e.g. ⏪ ⏯ ⏩ in that order, or allow certain interactions for hanging up (swipe?).

Requesting the controls could be done with requests before the PiP surface role is set and then the PiP "subclass" just sends events back when the controls are activated.

Yea, just as xdg_toplevel state, set things up before an initial commit.

we probably don't want to allow any alpha in there (i.e. interpret everything as opaque),

How so? What makes these more problematic than regular transparent toplevels?

I imagine one would want to overlay certain sets of elements depending what they are (when possible), e.g. ⏪ ⏯ ⏩ in that order, or allow certain interactions for hanging up (swipe?).

Uh, I have not thought about gestures, yet, that's a really good point.

How so? What makes these more problematic than regular transparent toplevels?

I think my argument was backwards: the surface does not handle resize and repositioning so it doesn't draw a border which is the main reason for the alpha channel, right?

The main reason for the alpha channel was that Wayland "by default" assumes clients draw the whole window content, including any drop shadow. A unresizeable window still has the shadow border even though it doesn't support interactive resize.

An alpha channel would also be needed for e.g. rounded corners. See e.g. a GNOME mockup that has a PiP window in it: https://gitlab.gnome.org/Teams/Design/os-mockups/-/raw/b8914b8b3092db0f713621b50fa46f7f210e9651/mobile-shell/tiling/tiling-edge-cases.png

Right, but the PiP window can be resized, just not by the client. The compositor needs something to let the user activate the resize and if we let the PiP surface draw decorations (and also not draw decorations) there is nothing to do so.

Rounded corners and drop shadows could be implemented in the compositor which guarantees that there is something to resize. It also guarantees a certain look which is compatible with the control overlay. OTOH it requires the compositor to do even more stuff and the damage clients can do with it is not huge.

Right, but the PiP window can be resized, just not by the client. The compositor needs something to let the user activate the resize and if we let the PiP surface draw decorations (and also not draw decorations) there is nothing to do so.

If the PiP window is uninteractive, we can probably assume that whatever is outside of the window geometry can be used for initiating interactive resizes. Better yet if we document that as possible compositor behavior.

Rounded corners and drop shadows could be implemented in the compositor which guarantees that there is something to resize. It also guarantees a certain look which is compatible with the control overlay. OTOH it requires the compositor to do even more stuff and the damage clients can do with it is not huge.

I don't think the compositor should try to "round" the clients corners, that's the clients job, not the compositors, as the compositor doesn't know if or how it should do it to make it look right. Drop shadow it can add in theory, but only if the client can only provide fully opaque rectangular content.

Rounded corners and drop shadows could be implemented in the compositor which guarantees that there is something to resize. It also guarantees a certain look which is compatible with the control overlay. OTOH it requires the compositor to do even more stuff and the damage clients can do with it is not huge.

Rounding corners on the compositor side makes things such as direct scanout in overlay planes impossible. Also, with my kwin hat, we're against rounding corners on the compositor side, see also https://mail.kde.org/pipermail/kwin/2021-June/005237.html Edit: in other words, we also think that corners have to be rounded by the client

Rounding corners on the compositor side makes things such as direct scanout in overlay planes impossible.

From my understanding it's actually the opposite: direct-scanout (in a zero-copy sense) with rounded corners for video is actually only possible when done by the compositor, at least as long as video decoders and display controllers mainly/only support alpha-less formats like NV12 / P010. That would work by putting the video on a plane behind the compositor/render plane (i.e. on an underlay plane, see 1.) and punching an alpha-hole into the later, with bespoke rounded corners.

If rounded corners are left to clients, they'd be forced to make a copy - unless we want to introduce a "rounded-corners" or "alpha mask" protocol.

From what I know, the distinction between primary, overlay and underlay planes is increasingly vanishing in favor of just treating them as planes. Depending on hardware/kms-drivers a plane might have a fixed z-position and support only certain formats - but otherwise there doesn't need to be further differences. I.e. on a lot of hardware there's no strict reason for compositors to use the primary plane for the own buffers. Edit: a pip-video could thus be put on the primary plane (it doesn't need to cover the whole screen) and the compositor would use an overlay plane for its content. This approach should work on e.g. any semi-recent Intel/AMD system AFAIK (for AMD there's some patches in the works to solve some issues with cursors).

added 1 commit

c3cfd2f7 - xdg-pip: Add new protocol

Compare with previous version

added 1 commit

dcceb95a - xdg-pip: Add new protocol

Compare with previous version

added 1 commit

882f306b - xdg-pip: Add new protocol

Compare with previous version

added 1 commit

672c8e61 - xdg-pip: Add new protocol

Compare with previous version

marked this merge request as ready

changed the description

Can somebody add relevant labels? It looks like only wayland-protocols developers can do that.

Admin message

Admin message

xdg-pip: Add new protocol

Merge request reports

Activity