Conditionally merging sink-inputs
I am trying to capture the output of Web Speech API speechSynthesis.speak()
on Chrome browser.
This provides an interesting challenge for several reasons
-
Chrome fetches audio from a remote service and outputs Google voices in an extension background page in an HTML
<audio>
element, usingapplication.icon_name = "google-chrome"
andmedia.name = "Playback"
perpactl list sink-inputs
, and that application name is listed only when the audio is being output; `; -
When a voice other than Google voices are use
speech-dispatcher
, where the audio is not routed through an extension background page, usingmedia.name = "playback"
,application.name = "speech-dispatcher-espeak-ng"
perpactl lis sink-inputs
; -
Chrome does not support capture of monitor devices for
navigator.mediaDevices.getUserMedia({audio: true})
So, to capture output of espeak-ng
voices via speech-dispatcher
I use
pactl load-module module-combine-sink \
sink_name=Web_Speech_Sink slaves=$(pacmd list-sinks | grep -A1 "* index" | grep -oP "<\K[^ >]+") \
sink_properties=device.description="Web_Speech_Stream" \
format=s16le \
channels=1 \
rate=22050
pactl move-sink-input $(pacmd list-sink-inputs | tac | perl -E'undef$/;$_=<>;/speech-dispatcher-espeak-ng.*?index: (\d+)\n/s;say $1') Web_Speech_Sink
pactl load-module module-remap-source \
master=Web_Speech_Sink.monitor \
source_name=Web_Speech_Monitor \
source_properties=device.description=Web_Speech_Output
which provides a microphone input that is actually output so that I can capture speech-dispatcher
output that Chrome interprets as microphone input, i.e.,
var permission = await navigator.permissions.request({name: 'microphone'});
var devices = await navigator.mediaDevices.enumerateDevices();
console.log(permission, devices);
var device = devices.find(({label}) => label === 'Web_Speech_Output');
navigator.mediaDevices.getUserMedia({audio: {
echoCancellation: false,
autoGainControl: false,
noiseSuppression: false,
deviceId: {exact: device.deviceId}}
})
.then(stream => {
const recorder = new MediaRecorder(stream);
recorder.onstop = () => recorder.stream.getAudioTracks()[0].stop();
recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
const synth = speechSynthesis;
const u = new SpeechSynthesisUtterance('test');
const voice = speechSynthesis.getVoices().find(({name, lang}) => name.includes('espeak-ng') && lang.includes('en'));
u.voice = voice;
u.onstart = e => {
recorder.start();
console.log(e);
}
u.onend = e => {
recorder.stop();
console.log(e);
}
synth.speak(u);
});
The challenge arises when both Google voices and espeak-ng
voices are used in complex SSML parsed input, e.g., https://github.com/guest271314/SSMLParser/blob/master/docs/index.html (https://guest271314.github.io/SSMLParser/) which output both Google and espeak-ng
voices. We also do not want to just capture the system wide monitor
pactl load-module module-remap-source \
master=virtmic.monitor source_name=virtmic \
source_properties=device.description=Virtual_Microphone
where other tabs could be playing audio that should be excluded from speech synthesis capture.
How to create a microphone output that conditionally captures Google voices when played, and espeak-ng
voices when played into a single stream that I can capture with navigator.mediaDevice.getUserMedia()
?