Assessment of Self-Supervised Denoising Methods for Esophageal Speech Enhancement
By Madiha Amarjouf, El Hassan Ibn Elhaj, Mouhcine Chami, Kadria Ezzine and Joseph Di Martino
Abstract: Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover,
acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby
complicating enhancing this kind of voice. That is why most supervised methods used to enhance
ES are based on voice conversion, which uses healthy speaker targets, things that may not preserve
the speaker’s identity. Otherwise, unsupervised methods for ES are mostly based on traditional
filters, which cannot alone beat this kind of noise, making the denoising process difficult. Also, these
methods are known for producing musical artifacts. To address these issues, a self-supervised method
based on the Only-Noisy-Training (ONT) model was applied, consisting of denoising a signal without
needing a clean target. Four experiments were conducted using Deep Complex UNET (DCUNET)
and Deep Complex UNET with Complex Two-Stage Transformer Module (DCUNET-cTSTM) for
assessment. Both of these models are based on the ONT approach. Also, for comparison purposes
and to calculate the evaluation metrics, the pre-trained VoiceFixer model was used to restore the
clean wave files of esophageal speech. Even with the fact that ONT-based methods work better with
noisy wave files, the results have proven that ES can be denoised without the need for clean targets,
and hence, the speaker’s identity is retained.
ES enhancement using DCUNET and DCUNET-cTSTM
Results of testing with ES wave files using the pre-trained models of the 4 experiments
Experiment 1: Original ES test wave file
DCUNET
DCUNET-cTSTM
Experiment 2: Original ES test wave file
DCUNET
DCUNET-cTSTM
Experiment 3: Original ES test wave file
DCUNET
DCUNET-cTSTM
Experiment 4: Original ES test wave file
DCUNET
DCUNET-cTSTM
Results of testing with VF wave files using the pre-trained models of the 4 experiments