Adapting Large-Scale Pre-trained Models for Unified Dialect Speech Recognition Model
Main Article Content
Abstract
Recent advancements in deep learning techniques utilizing large-scale data, such as self-supervised learning, have significantly improved the accuracy of speech and language processing technologies for major world languages. However, for dialects with limited transcription resources, technologies like automatic speech recognition and search have yet to be realized at a practical level. This issue is particularly pronounced in Japanese dialects, which are classified into dozens of different and mixed dialects, and remains unresolved. In this study, we focus on two large-scale pre-trained models that have demonstrated top-tier performance in recent automatic speech recognition system research, and present examples of unified automatic speech recognition systems adapted for Japanese dialects, as well as the potential applications of the content detection task — query-by-example spoken term detection. Both compared models are trained on thousands or more hours of multilingual speech, with one being an automatic speech recognition model based on self-supervised learning and the other (Whisper) a model based on multi-task learning, including machine translation. Experiments on automatic speech recognition models are conducted using several tens of hours of adaptation data for both standard Japanese and Japanese dialects, which have distinct characteristics depending on the region. The result shows that the dialect-independent automatic speech recognition model based on the self-supervised learning pre-trained model and 3-step adaptation strategy achieves the best accuracy with a character error rate of 29.2%, suggesting that it is important to consider regional identity due to the diversity and limited resources of Japanese dialects.
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, in: Advances in Neural Information Processing Systems, Vol. 33, Eds. H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin, Curran Associates, 2020 p. 12449
W.-N. Hsu, B. Bolte, Y.-H.H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed in: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29, IEEE, 2021, p. 3451
S. Chen, Y. Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, in: 2022 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2022), IEEE, 2022, p. 6152
S. Miwa, A. Kai. in: Proc. Interspeech 2023, 2023, p. 4928
A. Radford, J.W. Kim, T. Xu, G. Brockman, C. Mcleavey, I. Sutskever, in: Int. Conf. on Machine Learning, 2023, p. 28492
A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, in: Proc. Interspeech 2021, 2021, p. 2426
A. Babu, C. Wang, A. Tjandra et al., in: Proc. Interspeech 2022, 2022, p. 2278
N. Takahashi S. Miwa, Y. Kamiya, T. Toyama, R. Nahar, A. Kai, in: 2024 IEEE 13th Global Conf. on Consumer Electronics (GCCE 2024), 2024
K. Maekawa H. Koiso, S. Furui, H. Isahara, in: Proc. of the 2nd Int. Conf. on Language Resources and Evaluation (LRE'00), ELRA, 2000
COJADS, 2024
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, M. Auli, in: Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019) — Demonstrations, Association for Computational Linguistics, 2019 p. 48
S. Watanabe, T. Hori, S. Karita et al., in: Proc. Interspeech 2021, Hyderabad, India, 2021, p. 2207
T. Kudo, J. Richardson, in: Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), Eds. E. Blanco, W. Lu, 2018, p. 66
S.-W. Yang, P.-H. Chi, Y.-S. Chuang et al., in: Proc. Interspeech 2021, 2021, p. 1194
Semantic Scholar, ''OpenKWS 13 Keyword Search Evaluation Plan 1'', 2013