Citation: Shuheng, W.; Heyan, H.;
Shumin, S. Improving
Non-Autoregressive Machine
Translation Using Sentence-Level
Semantic Agreement. Appl. Sci. 2022,
12, 5003. https://doi.org/10.3390/
app12105003
Academic Editors: Phivos Mylonas,
Katia Lida Kermanidis and Manolis
Maragoudakis
Received: 19 April 2022
Accepted: 13 May 2022
Published: 16 May 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Improving Non-Autoregressive Machine Translation Using
Sentence-Level Semantic Agreement
Shuheng Wang
1
, Heyan Huang
2
and Shumin Shi
2,
*
1
School of Computer Science and Engineering, Nanjing University of Science and Technology,
Nanjing 210094, China; wsh@njust.edu.cn
2
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100811, China;
hhy63@bit.edu.cn
* Correspondence: bjssm@bit.edu.cn
Abstract:
Theinference stage can be accelerated significantly using a Non-Autoregressive Transformer
(NAT). However, the training objective used in the NAT model also aims to minimize the loss
between the generated words and the golden words in the reference. Since the dependencies between
the target words are lacking, this training objective computed at word level can easily cause semantic
inconsistency between the generated and source sentences. To alleviate this issue, we propose a new
method, Sentence-Level Semantic Agreement (SLSA), to obtain consistency between the source and
generated sentences. Specifically, we utilize contrastive learning to pull the sentence representations
of the source and generated sentences closer together. In addition, to strengthen the capability of the
encoder, we also integrate an agreement module into the encoder to obtain a better representation
of the source sentence. The experiments are conducted on three translation datasets: the WMT
2014 EN
→
DE task, the WMT 2016 EN
→
RO task, and the IWSLT 2014 DE
→
DE task, and
the improvement in the NAT model’s performance shows the effect of our proposed method.
Keywords: machine translation; non-autoregressive; contrastive learning; semantic agreement
1. Introduction
Over the years, tremendous success has been achieved in encoder–decoder based neu-
ral machine translation (NMT) [
1
–
3
]. The encoder maps the source sentence into a hidden
representation, and the target sentence is generated by the decoder from the hidden repre-
sentation in an autoregressive method. This autoregressive method has assisted the NMT
model in obtaining high accuracy [
3
]. However, because it needs the previously predicted
words as inputs, this also limits the speed of the inference stage. Recently,
Gu et al. [4]
proposed a non-autoregressive transformer (NAT) to break the limitation and reduce the in-
ference latency. In general, the NAT model also utilizes the encoder–decoder framework.
However, by removing the autoregressive method in the decoder, the NAT model can
significantly expedite the decoding stage. Yet, the performance of the NAT model still lags
behind the NMT model.
During training, the NAT model, as the NMT model, uses a word-level cross-entropy
to optimize the whole model. Nevertheless, under the background of non-autoregressive
translation, the dependencies in the target words cannot be learned properly with the word-
level cross-entropy [
5
]. Although it encourages the NAT model to generate the correct
token at each position, due to the lack of target dependency, the NAT model cannot
consider global correctness. The NAT model cannot efficiently model the target dependency
well, and the cross-entropy loss further weakens this feature, causing undertranslation
or overtranslation [
5
]. Recently, some research has proposed ways to alleviate this issue. For
example, Sun et al. [
6
] utilized a CRF module to model the global path in the decoder, and
Shao et al. [5] used a bag-of-words loss to encourage the NAT model to capture the target
dependency. However, this previous research only considered global or partial modeling
Appl. Sci. 2022, 12, 5003. https://doi.org/10.3390/app12105003 https://www.mdpi.com/journal/applsci