Article
One Spatio-Temporal Sharpening Attention Mechanism for
Light-Weight YOLO Models Based on Sharpening
Spatial Attention
Mengfan Xue
1
, Minghao Chen
1,2
, Dongliang Peng
1,
*, Yunfei Guo
1
and Huajie Chen
1
Citation: Xue, M.; Chen, M.; Peng,
D.; Guo, Y.; Chen, H. One
Spatio-Temporal Sharpening
Attention Mechanism for
Light-Weight YOLO Models Based on
Sharpening Spatial Attention. Sensors
2021, 21, 7949. https://doi.org/
10.3390/s21237949
Academic Editor: Stefanos Kollias
Received: 12 October 2021
Accepted: 23 November 2021
Published: 28 November 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China; xuemf@hdu.edu.cn (M.X.);
192060268@hdu.edu.cn (M.C.); gyf@hdu.edu.cn (Y.G.); chj247@hdu.edu.cn (H.C.)
2
HDU-ITMO Joint Institute, Hangzhou Dianzi University, Hangzhou 310018, China
* Correspondence: dlpeng@hdu.edu.cn
Abstract:
Attention mechanisms have demonstrated great potential in improving the performance
of deep convolutional neural networks (CNNs). However, many existing methods dedicate to
developing channel or spatial attention modules for CNNs with lots of parameters, and complex
attention modules inevitably affect the performance of CNNs. During our experiments of embedding
Convolutional Block Attention Module (CBAM) in light-weight model YOLOv5s, CBAM does
influence the speed and increase model complexity while reduce the average precision, but Squeeze-
and-Excitation (SE) has a positive impact in the model as part of CBAM. To replace the spatial
attention module in CBAM and offer a suitable scheme of channel and spatial attention modules, this
paper proposes one Spatio-temporal Sharpening Attention Mechanism (SSAM), which sequentially
infers intermediate maps along channel attention module and Sharpening Spatial Attention (SSA)
module. By introducing sharpening filter in spatial attention module, we propose SSA module
with low complexity. We try to find a scheme to combine our SSA module with SE module or
Efficient Channel Attention (ECA) module and show best improvement in models such as YOLOv5s
and YOLOv3-tiny. Therefore, we perform various replacement experiments and offer one best
scheme that is to embed channel attention modules in backbone and neck of the model and integrate
SSAM into YOLO head. We verify the positive effect of our SSAM on two general object detection
datasets VOC2012 and MS COCO2017. One for obtaining a suitable scheme and the other for proving
the versatility of our method in complex scenes. Experimental results on the two datasets show
obvious promotion in terms of average precision and detection performance, which demonstrates
the usefulness of our SSAM in light-weight YOLO models. Furthermore, visualization results also
show the advantage of enhancing positioning ability with our SSAM.
Keywords: attention mechanism; object detection; YOLO; light-weight model; sharpening filter
1. Introduction
Convolutional neural networks have achieved great progress in the field of visual ob-
ject detection and tracking by rich and expressive performance. Most researchers normally
study its innovations in depth, width and structure [
1
–
3
]. In addition, the most important
indicators for evaluating an object detector are accuracy and speed. As a guide, the visual
object detector based on neural network can be divided into one-stage detector [
4
–
10
]
and two-stage detector [
11
,
12
]. The most representative two-stage object detector is the R-
CNN [
13
] series, which generally extracts the image feature by feature extraction networks,
inputs the feature maps into region proposal network to generate regions of interest as first
prediction and then makes classification and regression operations as second prediction.
While the one-stage detector only passes through one prediction operation to perform
the object detection task and combines classification and positioning together. Compared
with the two-stage detector, the one-stage detector gains a substantial speed increase at
Sensors 2021, 21, 7949. https://doi.org/10.3390/s21237949 https://www.mdpi.com/journal/sensors