Citation: Tan, Q.; Fang, Z.; Jiang, X.
KD-PatchMatch: A Self-Supervised
Training Learning-Based PatchMatch.
Appl. Sci. 2023, 13, 2224. https://
doi.org/10.3390/app13042224
Academic Editor: Silvia Liberata Ullo
Received: 31 December 2022
Revised: 30 January 2023
Accepted: 6 February 2023
Published: 9 February 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
KD-PatchMatch: A Self-Supervised Training
Learning-Based PatchMatch
Qingyu Tan, Zhijun Fang * and Xiaoyan Jiang
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science,
Shanghai 201620, China
* Correspondence: zjfang@sues.edu.cn
Abstract:
Traditional learning-based multi-view stereo (MVS) methods usually need to find the cor-
rect depth value from a large number of depth candidates, which leads to huge memory consumption
and slow inference. To address these problems, we propose a probabilistic depth sampling in the
learning-based PatchMatch framework, i.e., sampling a small number of depth candidates from a
single-view probability distribution, which achieves the purpose of saving computational resources.
Furthermore, to overcome the difficulty of obtaining ground-truth depth for outdoor large-scale
scenes, we also propose a self-supervised training pipeline based on knowledge distillation, which
involves self-supervised teacher training and student training based on knowledge distillation. Ex-
tensive experiments show that our approach outperforms other recent learning-based MVS methods
on DTU, Tanks and Temples, and ETH3D datasets.
Keywords:
multi-view stereo; learning-based PatchMatch; probabilistic depth sampling; knowledge
distillation
1. Introduction
Given multiple RGB images with known camera poses, multi-view stereo (MVS)
intends to reconstruct a 3D dense point cloud of the image scene. Multi-view stereo has
a wide range of applications, including mapping [
1
], self-driving cars [
2
], infrastructure
inspection [3], robotics [4], etc.
Convolutional neural networks have demonstrated very powerful capabilities in multi-
view 3D reconstruction problems in recent years, owing to the continuing development
of deep learning. Many learning-based methods [
5
–
8
] can incorporate global semantic
information, such as specular prior and reflection prior, to improve the robustness of the
matching and thus solve the challenges that cannot be overcome by traditional methods.
However, MVS still has many challenges, such as untextured areas, occlusion, and non-
Lambertian surfaces [9–11].
When MVSNet [
7
] is proposed, the learning-based MVS domain constructs the cost
volume of image pairs using front-to-parallel and differentiable homography. Many sub-
sequent networks are improved on this basis. For example, R-MVSNet [
8
] innovates the
regularization of the cost volume in the depth dimension by using Conv-GRU layer-by-
layer processing to reduce the memory consumption; CasMVSNet [
12
] proposes the first
coarse-to-fine structure paradigm to optimize the memory consumption and computational
efficiency; Vis-MVSNet [
13
] and CVP-MVSNet [
14
] consider in depth the aggregation
approach of cost volume and the range of depth assumptions in the subsequent stages
of coarse-to-fine from multiple views, respectively, resulting in substantial performance
improvements. PatchMatchNet [
15
] is the first model that introduces the traditional stereo
matching algorithm (PatchMatch) into an end-to-end MVS framework.
Most learning-based MVS methods [
5
,
7
,
8
,
13
,
16
] employ the same set of depth hy-
pothesis candidates for all pixels (i.e., sampled between hand-picked limits
d
min
and
d
max
),
Appl. Sci. 2023, 13, 2224. https://doi.org/10.3390/app13042224 https://www.mdpi.com/journal/applsci