Citation: Wang, D.; Wei, Y.; Zhang,
K.; Ji, D.; Wang, Y. Automatic Speech
Recognition Performance
Improvement for Mandarin Based on
Optimizing Gain Control Strategy.
Sensors 2022, 22, 3027. https://
doi.org/10.3390/s22083027
Academic Editors: Enrico Vezzetti,
Gabriele Baronio, Domenico
Speranza, Luca Ulrich and Andrea
Luigi Guerra
Received: 24 March 2022
Accepted: 12 April 2022
Published: 15 April 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Automatic Speech Recognition Performance Improvement for
Mandarin Based on Optimizing Gain Control Strategy
Desheng Wang , Yangjie Wei * , Ke Zhang, Dong Ji and Yi Wang
Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, School of Computer Science
and Engineering, Northeastern University, Shenyang 110169, China; deshengwang001@gmail.com (D.W.);
1910621@stu.neu.edu.cn (K.Z.); jidong@cse.neu.edu.cn (D.J.); wangyi@cse.neu.edu.cn (Y.W.)
* Correspondence: weiyangjie@cse.neu.edu.cn
Abstract:
Automatic speech recognition (ASR) is an essential technique of human–computer inter-
actions; gain control is a commonly used operation in ASR. However, inappropriate gain control
strategies can lead to an increase in the word error rate (WER) of ASR. As there is a current lack of
sufficient theoretical analyses and proof of the relationship between gain control and WER, various
unconstrained gain control strategies have been adopted on realistic ASR systems, and the optimal
gain control with respect to the lowest WER, is rarely achieved. A gain control strategy named
maximized original signal transmission (MOST) is proposed in this study to minimize the adverse
impact of gain control on ASR systems. First, by modeling the gain control strategy, the quantitative
relationship between the gain control strategy and the ASR performance was established using the
noise figure index. Second, through an analysis of the quantitative relationship, an optimal MOST
gain control strategy with minimal performance degradation was theoretically deduced. Finally,
comprehensive comparative experiments on a Mandarin dataset show that the proposed MOST gain
control strategy can significantly reduce the WER of the experimental ASR system, with a 10% mean
absolute WER reduction at −9 dB gain.
Keywords:
human–computer interaction; automatic speech recognition (ASR); word error rate
(WER); gain control; noise figure; maximized original signal transmission (MOST)
1. Introduction
Automatic speech recognition (ASR) has been widely integrated into human–robot
interactions in the form of voice user interfaces (VUIs) [
1
–
3
]. Virtual assistants [
4
], vehicle
systems [
5
], and home automation all make daily life more convenient [
6
–
9
], and the
application scope of ASR is growing in popularity as more people have recognized VUIs as
more natural than graphical user interfaces (GUIs) [10,11].
Currently, the performance of the ASR system in many human–robot interaction
scenarios is unsatisfactory due to robustness limitations, and one of the critical factors is
that various practical noises make it more challenging to extract the features, such as Mel-
frequency cepstral coefficients (MFCC) [
12
–
14
], log-channel energies [
15
], and pitch-based
features [
12
,
16
]. Some common noises have been widely researched by experts in ASR,
such as background noise [
9
,
17
], reverberation [
18
–
21
], squeal noise, and noises tightly
related to hardware, such as thermal noises from amplifiers [
22
], quantizing noises from
analog to digital converters (ADCs) [
23
], and signal quality loss caused by coding [
24
],
compression, and transmission [
25
]. However, noises related to gain controls have received
less attention. Gain control represents the amplitude adjustment of signals, and it is one of
the frequently used operations in ASR systems. A large gain may cause the ASR system
not to work properly, such as data overflow from the software perspective, and clipping
from the hardware perspective. Therefore, gain control in this paper refers to original gain
controls under the premise of no clipping occurring.
Sensors 2022, 22, 3027. https://doi.org/10.3390/s22083027 https://www.mdpi.com/journal/sensors