tinyML Asia 2021 Chanwoo Kim: A review of on-device fully neural end-to-end speech recognition…



A review of on-device fully neural end-to-end speech recognition and synthesis algorithms
Chanwoo KIM, Corporative Vice President, Samsung

In this talk, we review various end-to-end automatic speech recognition and speech synthesis algorithms and their optimization techniques for on-device applications. Conventional speech recognition systems comprise a large number of discrete components such as an acoustic model, a language model, a pronunciation model, a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted Finite-State Transducer (WFST), and so on. To obtain sufficiently high speech recognition accuracy with such conventional speech recognition systems, a very large language model (up to 100 GB) is usually needed. Hence, the corresponding WFST size becomes enormous, which prohibits their on-device implementation. Recently, fully neural network end-to-end speech recognition algorithms have been proposed. Examples include speech recognition systems based on Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and so on. The inverse process of speech recognition is speech synthesis where a text sequence is converted into a waveform. Conventional speech synthesizers are usually based on parametric or concatenative approaches. Even though Text-to-Speech (TTS) systems based on the concatenative approaches have shown relatively good sound quality, they cannot be easily employed for on-device applications because of their immense size. Recently, neural speech synthesis approaches based on Tacotron and Wavenet started a new era of TTS with significantly better speech quality. More recently, vocoders based on LPCnet require significantly smaller computation than Wavenet, which makes it feasible to run these algorithms on on-device platforms. These fully neural network-based systems require much smaller memory footprints compared to conventional algorithms.

source

Leave a Reply

Your email address will not be published.