Conditionally merging sink-inputs

I am trying to capture the output of Web Speech API speechSynthesis.speak() on Chrome browser.

This provides an interesting challenge for several reasons

Chrome fetches audio from a remote service and outputs Google voices in an extension background page in an HTML <audio> element, using application.icon_name = "google-chrome" and media.name = "Playback" per pactl list sink-inputs, and that application name is listed only when the audio is being output; `;
When a voice other than Google voices are use speech-dispatcher, where the audio is not routed through an extension background page, using media.name = "playback", application.name = "speech-dispatcher-espeak-ng" per pactl lis sink-inputs;
Chrome does not support capture of monitor devices for navigator.mediaDevices.getUserMedia({audio: true})

So, to capture output of espeak-ng voices via speech-dispatcher I use

pactl load-module module-combine-sink \
sink_name=Web_Speech_Sink slaves=$(pacmd list-sinks | grep -A1 "* index" | grep -oP "<\K[^ >]+") \
sink_properties=device.description="Web_Speech_Stream" \
format=s16le \
channels=1 \
rate=22050

pactl move-sink-input $(pacmd list-sink-inputs | tac | perl -E'undef$/;$_=<>;/speech-dispatcher-espeak-ng.*?index: (\d+)\n/s;say $1') Web_Speech_Sink


pactl load-module module-remap-source \
master=Web_Speech_Sink.monitor \
source_name=Web_Speech_Monitor \
source_properties=device.description=Web_Speech_Output

which provides a microphone input that is actually output so that I can capture speech-dispatcher output that Chrome interprets as microphone input, i.e.,

var permission = await navigator.permissions.request({name: 'microphone'});
var devices = await navigator.mediaDevices.enumerateDevices();
console.log(permission, devices);
var device = devices.find(({label}) => label === 'Web_Speech_Output');

navigator.mediaDevices.getUserMedia({audio: {
  echoCancellation: false, 
  autoGainControl: false, 
  noiseSuppression: false,
  deviceId: {exact: device.deviceId}}
})

.then(stream => {
  const recorder = new MediaRecorder(stream);
  recorder.onstop = () => recorder.stream.getAudioTracks()[0].stop();
  recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
  const synth = speechSynthesis;
  const u = new SpeechSynthesisUtterance('test');
  const voice = speechSynthesis.getVoices().find(({name, lang}) => name.includes('espeak-ng') && lang.includes('en'));
  u.voice = voice;
  u.onstart = e => {
    recorder.start();
    console.log(e);
  }
  u.onend = e => {
    recorder.stop();
    console.log(e);
  }
  synth.speak(u);
});

The challenge arises when both Google voices and espeak-ng voices are used in complex SSML parsed input, e.g., https://github.com/guest271314/SSMLParser/blob/master/docs/index.html (https://guest271314.github.io/SSMLParser/) which output both Google and espeak-ng voices. We also do not want to just capture the system wide monitor

pactl load-module module-remap-source \ 
    master=virtmic.monitor source_name=virtmic \ 
    source_properties=device.description=Virtual_Microphone

where other tabs could be playing audio that should be excluded from speech synthesis capture.

How to create a microphone output that conditionally captures Google voices when played, and espeak-ng voices when played into a single stream that I can capture with navigator.mediaDevice.getUserMedia()?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information