Audio-Visual Active Speaker Extraction
for Sparsely Overlapped Multi-talker Speech

Junjie Li1, Ruijie Tao3, Zexu Pan3, Meng Ge3, Shuai Wang1, Haizhou Li1,2,3
1Shenzhen Research Institute of Big Data, Shenzhen, China
2The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
3Department of Electrical and Computer Engineering, National University of Singapore, Singapore

Accepted by ICASSP 2024 [pdf]
Email: mrjunjieli@tju.edu.cn


1. Abstract

Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.



2. Videos

2.1 Real Conversation

In this section, instead of using IEMOCAP-2mix, we train a model using more data to handle real conversation scenarios.
Dataset:
    Speech: Voxceleb2, LRS3, Mead and an unreleased dataset CUHK-law
    Noise: Musan,NoiseX-92 and wham_noise
    Reverberation: SLR26 and SLR28

2.1.1 Fully-overlapped speech

2.1.2 Conversational speech

2.1.3 Single Speaker Speaking

Thanks to Xinyi Chen and Xi Chen for helping me record this video.

3. AudioSamples

In this section, model is pretrained using Voxceleb2-2Mix and finetuned using IEMOCAP-2mix.
Baselines: AV-Sepformer, USEV, ASD+AV-Sepformer
Proposed method: ActiveExtract

3.1 Target Absent Clips

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.2 0% overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.3 (0%,20%] overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.4 (20%,40%] overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.5 (40%,60%] overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.6 (60%,80%] overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

3.7 (80%,100%] overlapping ratio

Mixture Clean AV-Sepformer[1] USEV[2] ASD+AV-Sepformer ActiveExtract

[1] Lin, Jiuxin, et al. "AV-Sepformer: Cross-Attention Sepformer for Audio-Visual Target Speaker Extraction." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[2] Pan, Zexu, Meng Ge, and Haizhou Li. "USEV: Universal speaker extraction with visual cue." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022): 3032-3045.