Citation: Guo, Z.; Wang, C.; Yang, G.;
Huang, Z.; Li, G. MSFT-YOLO:
Improved YOLOv5 Based on
Transformer for Detecting Defects of
Steel Surface. Sensors 2022, 22, 3467.
https://doi.org/10.3390/s22093467
Academic Editor: Jianbo Yu
Received: 20 February 2022
Accepted: 29 April 2022
Published: 2 May 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
MSFT-YOLO: Improved YOLOv5 Based on Transformer for
Detecting Defects of Steel Surface
Zexuan Guo
1
, Chensheng Wang
2,
* , Guang Yang
2
, Zeyuan Huang
3
and Guo Li
1
1
School of Modern Post, Beijing University of Posts and Telecommunications, Beijing 100876, China;
gzx152@bupt.edu.cn (Z.G.); liguo@bupt.edu.cn (G.L.)
2
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China;
yang@bupt.edu.cn
3
Teaching Affairs Office, Beijing University of Posts and Telecommunications, Beijing 100876, China;
huangzeyuan@bupt.edu.cn
* Correspondence: cwang@bupt.edu.cn
Abstract:
With the development of artificial intelligence technology and the popularity of intelligent
production projects, intelligent inspection systems have gradually become a hot topic in the industrial
field. As a fundamental problem in the field of computer vision, how to achieve object detection in
the industry while taking into account the accuracy and real-time detection is an important challenge
in the development of intelligent detection systems. The detection of defects on steel surfaces is
an important application of object detection in the industry. Correct and fast detection of surface
defects can greatly improve productivity and product quality. To this end, this paper introduces the
MSFT-YOLO model, which is improved based on the one-stage detector. The MSFT-YOLO model
is proposed for the industrial scenario in which the image background interference is great, the
defect category is easily confused, the defect scale changes a great deal, and the detection results
of small defects are poor. By adding the TRANS module, which is designed based on Transformer,
to the backbone and detection headers, the features can be combined with global information. The
fusion of features at different scales by combining multi-scale feature fusion structures enhances the
dynamic adjustment of the detector to objects at different scales. To further improve the performance
of MSFT-YOLO, we also introduce plenty of effective strategies, such as data augmentation and
multi-step training methods. The test results on the NEU-DET dataset show that MSPF-YOLO can
achieve real-time detection, and the average detection accuracy of MSFT-YOLO is 75.2, improving
about 7% compared to the baseline model (YOLOv5) and 18% compared to Faster R-CNN, which is
advantageous and inspiring.
Keywords: steel surface; detected defects; MSFT-YOLO; YOLOv5; TRANS
1. Introduction
With the increasing development of artificial intelligence technology, there are more
applications in the industry that incorporate it. Computer vision methods, such as object
detection, are now widely being used in the task of detecting material surface defects [
1
].
In the process of workpiece manufacturing, surface defects in the material will reduce the
strength of the material, thus shortening the service life of the workpiece and affecting the
quality. However, these problems can be avoided if the material is inspected for defects
before processing. Therefore, automated and accurate object detection algorithms play
a very important role in the scenario of workpiece manufacturing.
In the field of computer vision, CNNs (convolutional neural networks) have become
the dominant model for vision tasks since 2012 [
2
]. As a hot topic in computer vision,
object detection algorithms can be divided into candidate region-based target detectors
(two-stage) and single-target detectors (one-stage). The representative algorithms of the
one-stage detector are the YOLO (you only look once) series [
3
–
6
], and the representative
Sensors 2022, 22, 3467. https://doi.org/10.3390/s22093467 https://www.mdpi.com/journal/sensors