Citation: Chen, M.; Zhao, H.; Liu, P.
Monocular 3D Object Detection
Based on Uncertainty Prediction of
Keypoints. Machines 2022, 10, 19.
https://doi.org/10.3390/
machines10010019
Academic Editors: Xiaochun Cheng
and Daming Shi
Received: 18 November 2021
Accepted: 23 December 2021
Published: 26 December 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
Monocular 3D Object Detection Based on Uncertainty
Prediction of Keypoints
Mu Chen
1,2,3,4,
* , Huaici Zhao
1,2,3,
* and Pengfei Liu
1,2,4
1
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China; liupengfei@sia.cn
2
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
3
University of Chinese Academy of Sciences, Beijing 100049, China
4
Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences,
Shenyang 110016, China
* Correspondence: chenmu@sia.cn (M.C.); hczhao@sia.cn (H.Z.)
Abstract:
Three-dimensional (3D) object detection is an important task in the field of machine vision,
in which the detection of 3D objects using monocular vision is even more challenging. We observe
that most of the existing monocular methods focus on the design of the feature extraction framework
or embedded geometric constraints, but ignore the possible errors in the intermediate process of
the detection pipeline. These errors may be further amplified in the subsequent processes. After
exploring the existing detection framework of keypoints, we find that the accuracy of keypoints
prediction will seriously affect the solution of 3D object position. Therefore, we propose a novel
keypoints uncertainty prediction network (KUP-Net) for monocular 3D object detection. In this work,
we design an uncertainty prediction module to characterize the uncertainty that exists in keypoint
prediction. Then, the uncertainty is used for joint optimization with object position. In addition, we
adopt position-encoding to assist the uncertainty prediction, and use a timing coefficient to optimize
the learning process. The experiments on our detector are conducted on the KITTI benchmark. For
the two levels of easy and moderate, we achieve accuracy of 17.26 and 11.78 in
AP
3D
, and achieve
accuracy of 23.59 and 16.63 in AP
BEV
, which are higher than the latest method KM3D.
Keywords: keypoints; uncertainty prediction; monocular 3D detection
1. Introduction
The understanding of 3D properties of objects in the real world is critical for vision-
based autonomous driving and traffic surveillance systems [
1
–
5
]. Compared with a two-
dimensional (2D) object detection task, the 3D object detection task involves nine degrees
of freedom, in which the length, width, height, and pose of the 3D bounding box need to
be detected. Currently, there are three main methods for 3D object detection: monocular
3D object detection, stereo-based 3D object detection and LIDAR-based 3D object detection.
Among them, the LIDAR-based and the stereo-based detection methods can usually obtain
higher detection accuracy with the provision of reliable depth information. However, the
radar system has the disadvantages of high cost, high energy consumption, and short
service life. On the contrary, the monocular detection method, which is characterized by
low cost and low energy consumption, has received extensive attention and attracted re-
searchers to conduct studies in this field. Therefore, our work focuses on the improvements
in monocular 3D object detection techniques.
Monocular 3D object detection takes a single RGB image as input, and outputs the
pose and dimension of the object in the real world. Due to the lack of depth information,
this process is ill-conditioned, and the ambiguity will occur in the process of inverse
projection from the 2D image plane to 3D space. Obviously, compared with stereo-based
and LIDAR-based methods, the task of monocular 3D object detection is more challenging.
Thanks to the powerful feature extraction and parameter regression capabilities of the
neural network, some original monocular 3D object detection pipelines [
6
,
7
] regress the 3D
Machines 2022, 10, 19. https://doi.org/10.3390/machines10010019 https://www.mdpi.com/journal/machines