Citation: Zhao, M.; Zhou, D.; Song,
X.; Chen, X.; Zhang, L. DiT-SLAM:
Real-Time Dense Visual-Inertial
SLAM with Implicit Depth
Representation and
Tightly-Coupled Graph Optimization.
Sensors 2022, 22, 3389. https://
doi.org/10.3390/s22093389
Academic Editors: Luis Payá, Oscar
Reinoso García and Helder Jesus
Araújo
Received: 19 March 2022
Accepted: 27 April 2022
Published: 28 April 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
DiT-SLAM: Real-Time Dense Visual-Inertial SLAM with
Implicit Depth Representation and Tightly-Coupled
Graph Optimization
Mingle Zhao
1,2
, Dingfu Zhou
2,3,
*, Xibin Song
2,3
, Xiuwan Chen
1
and Liangjun Zhang
2,3
1
Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China;
zhaomingle@pku.edu.cn (M.Z.); xwchen@pku.edu.cn (X.C.)
2
Robotics and Autonomous Driving Laboratory, Baidu Research, Beijing 100085, China;
song.sducg@gmail.com (X.S.); liangjunzhang@baidu.com (L.Z.)
3
National Engineering Laboratory of Deep Learning Technology and Application, Beijing 100085, China
* Correspondence: dingfuzhou@gmail.com
Abstract:
Recently, generating dense maps in real-time has become a hot research topic in the
mobile robotics community, since dense maps can provide more informative and continuous features
compared with sparse maps. Implicit depth representation (e.g., the depth code) derived from deep
neural networks has been employed in the visual-only or visual-inertial simultaneous localization and
mapping (SLAM) systems, which achieve promising performances on both camera motion and local
dense geometry estimations from monocular images. However, the existing visual-inertial SLAM
systems combined with depth codes are either built on a filter-based SLAM framework, which can
only update poses and maps in a relatively small local time window, or based on a loosely-coupled
framework, while the prior geometric constraints from the depth estimation network have not been
employed for boosting the state estimation. To well address these drawbacks, we propose DiT-
SLAM, a novel real-time
D
ense visual-inertial SLAM with
i
mplicit depth representation and
T
ightly-
coupled graph optimization. Most importantly, the poses, sparse maps, and low-dimensional depth
codes are optimized with the tightly-coupled graph by considering the visual, inertial, and depth
residuals simultaneously. Meanwhile, we propose a light-weight monocular depth estimation and
completion network, which is combined with attention mechanisms and the conditional variational
auto-encoder (CVAE) to predict the uncertainty-aware dense depth maps from more low-dimensional
codes. Furthermore, a robust point sampling strategy introducing the spatial distribution of 2D
feature points is also proposed to provide geometric constraints in the tightly-coupled optimization,
especially for textureless or featureless cases in indoor environments. We evaluate our system on
open benchmarks. The proposed methods achieve better performances on both the dense depth
estimation and the trajectory estimation compared to the baseline and other systems.
Keywords:
visual-inertial SLAM; depth estimation; implicit representation; graph optimization;
dense mapping
1. Introduction
Vision-based SLAM systems have been widely explored in the past 20 years and
many representative systems have been proposed, which include filter-based approaches
(e.g., MonoSLAM [
1
,
2
] and the optimization-based approaches (such as PTAM [
3
], DTAM [
4
],
and ORB-SLAM serials [
5
–
7
])). Recently, visual-inertial odometry or SLAM methods
combined with deep neural networks can achieve more accurate localization results [
8
–
11
],
while in the real-time applications, the dominated SLAM approaches are also based on
key or corner points extraction and tracking for accurate pose estimation. Furthermore,
for building the association between multi-frames in a longtime, a sparse structure map
is usually constructed and the bundle adjustment technique is utilized for optimizing
Sensors 2022, 22, 3389. https://doi.org/10.3390/s22093389 https://www.mdpi.com/journal/sensors