カンファレンス (国際) Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model
The International Conference on Acoustics, Speech, & Signal Processing 2023 (ICASSP 2023)
Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fischer corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.