Self-Supervised Model for Speech Tasks with Hugging face Transformers
Main Article Content
Abstract
For many years, speech recognition has been a focus of research. Automatic speech recognition (ASR) is the process for converting a speech signal into its corresponding sequence of words or other linguistic entities using algorithms implemented in a device. As our work and life are becoming integrated with mobile devices, such as tablets and smartphones (e.g., Amazon Alexa , Siri, Google Now, and Cortana), speech recognition technology has quickly become one of the most popular modes of communication.The arrival of this new trend is attributed to the significant progress made in several areas like high computing power and powerful deep learning models, leading to dramatically lower error rates in speech recognition systems. In this regard, our research is focused on reducing the error rate by using a self-supervised model for Speech Tasks. This paper presents the XLS-R model for multi-lingual speech representation learning based on wav2vec 2.0. XLS-R's new model learns basic speech units in order to answer a self-supervised task. The model is trained by predicting correct speech units for masked parts of the audio, while simultaneously learning what those units should be. The XLS-R model is fine-tuned by using Connectionist Temporal Classification (CTC), which is a technique used to train neural networks to solve sequence-to-sequence problems, such as automatic speech recognition (ASR) and handwriting recognition.We have used a common voice corpus in the Turkish language. This model performs well and the word error rate (WER) is significantly decreased.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.