Skip to content

Draft: impl-core: check serial number

P V requested to merge pvir/pipewire:serial-check into master

It seems Wireplumber + Zoom desktop sound share is hitting the race condition in pw_registry_destroy, so the serial check really seems needed.


The debug message indicates this is really hit in practice, happens sometimes on starting Zoom desktop share (with https://gitlab.freedesktop.org/pvir/wireplumber/-/commits/fix-policy-target-3/ which fixes other race conditions on Wireplumber side)

[D][09196.561491] pw.core      | [     impl-core.c:  135 registry_destroy()] registry 0x612000005d40: client late for global id:101 serial:115 recv-serial:113

Wireplumber global handling is written so that registry_global_destroy invalidates the ID immediately. The problem here is that it saw the registry_global events indicating that global:101 was reused only after trying to destroy it, and the destroy hit the wrong object. Here, id:101 is the null-sink node created moments before (added debug log there)

[I][09196.509095] mod.null-sink | [module-null-sink:   72 module_null_sink_proxy_bound()] proxy 0x60c0000310c0 bound, id:101, module index:536870915

Another example from logs without this patch:

Global ID 108 belongs to module-null-sink node, at one point in time. Wireplumber destroys it, even though it's trying to destroy a different stream.

$ grep -E '((registry_global|wp_global_bind|registry_global_remove|wp_global_proxy_request_destroy).*WpCore:0x625000001010)|add node|remove node'
...
D 13:44:23.179078    wp-global-proxy ../lib/wp/global-proxy.c:324:wp_global_proxy_request_destroy: <WpCore:0x625000001010> destroy global:108

## pw_registry_destroy(108)

D 13:44:23.182236        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:106 type:WpPort
D 13:44:23.182390        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:112 type:WpPort
D 13:44:23.182526        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:108 type:WpNode
D 13:44:23.182773 script/policy-node policy-node.lua:933:chunk: <WpNode:108:0x6250014f1a10> remove node: combine_output.sink-536870915.zoomrecord (108)

## Wireplumber thought it was destroying combine_output.sink-536870915.zoomrecord (108)

D 13:44:23.186913        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:113 type:WpClient
D 13:44:23.207452        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:113 perm:0x1c8 type:PipeWire:Interface:Client/3 -> WpClient
D 13:44:23.207710        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:108 perm:0x1c8 type:PipeWire:Interface:Node/3 -> WpNode
D 13:44:23.236084        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:108 type:WpNode

## But the destroy apparently hit global:108, that was zoomrecord (108), 
## because WP was apparently processing events late

D 13:44:23.238804        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:112 perm:0x1c8 type:PipeWire:Interface:Client/3 -> WpClient
D 13:44:23.240171        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:113
D 13:44:23.240248        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:112
D 13:44:23.248678        wp-registry ../lib/wp/object-manager.c:908:registry_global: <WpCore:0x625000001010> global:108 perm:0x1c8 type:PipeWire:Interface:Node/3 -> WpNode
D 13:44:23.250488        wp-registry ../lib/wp/object-manager.c:1384:wp_global_bind: <WpCore:0x625000001010> bind global:108
D 13:44:23.257925        wp-registry ../lib/wp/object-manager.c:940:registry_global_remove: <WpCore:0x625000001010> global removed:113 type:WpClient
D 13:44:23.259575 script/policy-node policy-node.lua:923:chunk: <WpNode:108:0x6250014f1b10> add node: zoomcombine (108)
...

## Note that Wireplumber never sees zoomrecord (108) node,
## because it was destroyed before it has time to process it

Meanwhile on pipewire-pulse side:

$ grep -E 'module-null-sink|LOAD_MODULE|UNLOAD_MODULE' ../pw-pulse.log
...
[I][06118.255883] mod.protocol-pulse | [  pulse-server.c: 4881 do_load_module()] [ZOOM VoiceEngine Loopback] LOAD_MODULE name:module-null-sink argument:sink_name=zoomrecord sink_properties=device.description=zoom_recording rate=48000
[I][06118.256001] mod.protocol-pulse | [        module.c:   88 module_load()] load module index:536870915 name:module-null-sink
[I][06118.261212] mod.null-sink | [module-null-sink:   72 module_null_sink_proxy_bound()] proxy 0x60c000163600 bound, id:108, module index:536870915
[I][06118.268546] mod.protocol-pulse | [  pulse-server.c: 4726 handle_module_loaded()] [ZOOM VoiceEngine Loopback] loaded module index:536870915 name:module-null-sink tag:168
[I][06118.269622] mod.protocol-pulse | [  pulse-server.c: 4881 do_load_module()] [ZOOM VoiceEngine Loopback] LOAD_MODULE name:module-combine-sink argument:sink_name=zoomcombine  sink_properties=device.description=zoom_combine_device slaves=zoomrecord rate=48000
[I][06118.312815] mod.null-sink | [module-null-sink:   48 module_null_sink_proxy_removed()] proxy 0x60c000163600 removed, module index:536870915
[I][06118.312837] mod.null-sink | [module-null-sink:   58 module_null_sink_proxy_destroy()] proxy 0x60c000163600 destroy, module index:536870915
[I][06118.324183] mod.protocol-pulse | [        module.c:  123 module_unload()] unload module index:536870915 name:module-null-sink
[W][06120.274620] mod.protocol-pulse | [         reply.c:   73 reply_error()] client 0x6130000ae080 [ZOOM VoiceEngine Loopback]: ERROR command:51 (LOAD_MODULE) tag:169 error:25 (I/O-virhe)
...

The reason why module_null_sink_proxy_removed was hit is that Wireplumber destroyed the node on the server side.


This MR is draft, because this appears to be causing issues elsewhere in Wireplumber, as some session items take 30sec to activate with it, when

[D][09028.228415] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:112 serial:165 recv-serial:164
[D][09028.228459] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:91 serial:166 recv-serial:164
[D][09028.228509] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:96 serial:167 recv-serial:164
[D][09028.228535] pw.core      | [     impl-core.c:   82 registry_bind()] registry 0x6120000043c0: client late for global id:97 serial:168 recv-serial:164

These globals apparently were some port ids it was trying to bind.

However, hitting this race condition in wireplumber is fairly hard and requires many tries (= run https://gitlab.freedesktop.org/pvir/wireplumber/-/commits/fix-policy-target-3 and open/close Zoom desktop sound share repeatedly) to trigger...

This should be fixed now.

Edited by P V

Merge request reports