Draft: impl-core: check serial number (!1170) · Merge requests · PipeWire / pipewire

P V requested to merge pvir/pipewire:serial-check into master Feb 19, 2022

It seems Wireplumber + Zoom desktop sound share is hitting the race condition in pw_registry_destroy, so the serial check really seems needed.

The debug message indicates this is really hit in practice, happens sometimes on starting Zoom desktop share (with https://gitlab.freedesktop.org/pvir/wireplumber/-/commits/fix-policy-target-3/ which fixes other race conditions on Wireplumber side)

[D][09196.561491] pw.core      | [     impl-core.c:  135 registry_destroy()] registry 0x612000005d40: client late for global id:101 serial:115 recv-serial:113

Wireplumber global handling is written so that registry_global_destroy invalidates the ID immediately. The problem here is that it saw the registry_global events indicating that global:101 was reused only after trying to destroy it, and the destroy hit the wrong object. Here, id:101 is the null-sink node created moments before (added debug log there)

[I][09196.509095] mod.null-sink | [module-null-sink:   72 module_null_sink_proxy_bound()] proxy 0x60c0000310c0 bound, id:101, module index:536870915

Another example from logs without this patch:

Global ID 108 belongs to module-null-sink node, at one point in time. Wireplumber destroys it, even though it's trying to destroy a different stream.

$ grep -E '((registry_global|wp_global_bind|registry_global_remove|wp_global_proxy_request_destroy).*WpCore:0x625000001010)|add node|remove node'
...
D 13:44:23.179078    wp-global-proxy ../lib/wp/global-proxy.c:324:wp_global_proxy_request_destroy: <WpCore:0x625000001010> destroy global:108

## pw_registry_destroy(108)

D 13:44:23.182236        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:106 type:WpPort
D 13:44:23.182390        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:112 type:WpPort
D 13:44:23.182526        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:108 type:WpNode
D 13:44:23.182773 script/policy-node policy-node.lua:933:chunk: <WpNode:108:0x6250014f1a10> remove node: combine_output.sink-536870915.zoomrecord (108)

## Wireplumber thought it was destroying combine_output.sink-536870915.zoomrecord (108)

D 13:44:23.186913        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:113 type:WpClient
D 13:44:23.207452        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:113 perm:0x1c8 type:PipeWire:Interface:Client/3 -> WpClient
D 13:44:23.207710        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:108 perm:0x1c8 type:PipeWire:Interface:Node/3 -> WpNode
D 13:44:23.236084        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:108 type:WpNode

## But the destroy apparently hit global:108, that was zoomrecord (108), 
## because WP was apparently processing events late

D 13:44:23.238804        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:112 perm:0x1c8 type:PipeWire:Interface:Client/3 -> WpClient
D 13:44:23.240171        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:113
D 13:44:23.240248        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:112
D 13:44:23.248678        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:108 perm:0x1c8 type:PipeWire:Interface:Node/3 -> WpNode
D 13:44:23.250488        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:108
D 13:44:23.257925        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:113 type:WpClient
D 13:44:23.259575 script/policy-node policy-node.lua:923:chunk: <WpNode:108:0x6250014f1b10> add node: zoomcombine (108)
...

## Note that Wireplumber never sees zoomrecord (108) node,
## because it was destroyed before it has time to process it

Meanwhile on pipewire-pulse side:

$ grep -E 'module-null-sink|LOAD_MODULE|UNLOAD_MODULE' ../pw-pulse.log
...
[I][06118.255883] mod.protocol-pulse | [  pulse-server.c: 4881 do_load_module()] [ZOOM VoiceEngine Loopback] LOAD_MODULE name:module-null-sink argument:sink_name=zoomrecord sink_properties=device.description=zoom_recording rate=48000
[I][06118.256001] mod.protocol-pulse | [        module.c:   88 module_load()] load module index:536870915 name:module-null-sink
[I][06118.261212] mod.null-sink | [module-null-sink:   72 module_null_sink_proxy_bound()] proxy 0x60c000163600 bound, id:108, module index:536870915
[I][06118.268546] mod.protocol-pulse | [  pulse-server.c: 4726 handle_module_loaded()] [ZOOM VoiceEngine Loopback] loaded module index:536870915 name:module-null-sink tag:168
[I][06118.269622] mod.protocol-pulse | [  pulse-server.c: 4881 do_load_module()] [ZOOM VoiceEngine Loopback] LOAD_MODULE name:module-combine-sink argument:sink_name=zoomcombine  sink_properties=device.description=zoom_combine_device slaves=zoomrecord rate=48000
[I][06118.312815] mod.null-sink | [module-null-sink:   48 module_null_sink_proxy_removed()] proxy 0x60c000163600 removed, module index:536870915
[I][06118.312837] mod.null-sink | [module-null-sink:   58 module_null_sink_proxy_destroy()] proxy 0x60c000163600 destroy, module index:536870915
[I][06118.324183] mod.protocol-pulse | [        module.c:  123 module_unload()] unload module index:536870915 name:module-null-sink
[W][06120.274620] mod.protocol-pulse | [         reply.c:   73 reply_error()] client 0x6130000ae080 [ZOOM VoiceEngine Loopback]: ERROR command:51 (LOAD_MODULE) tag:169 error:25 (I/O-virhe)
...

The reason why module_null_sink_proxy_removed was hit is that Wireplumber destroyed the node on the server side.

~~This MR is draft, because this appears to be causing issues elsewhere in Wireplumber, as some session items take 30sec to activate with it, when~~

[D][09028.228415] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:112 serial:165 recv-serial:164
[D][09028.228459] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:91 serial:166 recv-serial:164
[D][09028.228509] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:96 serial:167 recv-serial:164
[D][09028.228535] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:97 serial:168 recv-serial:164

~~These globals apparently were some port ids it was trying to bind.~~

However, hitting this race condition in wireplumber is fairly hard and requires many tries (= run https://gitlab.freedesktop.org/pvir/wireplumber/-/commits/fix-policy-target-3 and open/close Zoom desktop sound share repeatedly) to trigger...

This should be fixed now.

Edited Feb 19, 2022 by P V

Draft: impl-core: check serial number

Merge request reports