Citation: Liu, L.; Chen, E.; Ding, Y.
TR-Net: A Transformer-Based Neural
Network for Point Cloud Processing.
Machines 2022, 10, 517. https://
doi.org/10.3390/machines10070517
Academic Editors: Shuai Li, Dechao
Chen, Mohammed Aquil Mirza,
Vasilios N. Katsikis, Dunhui Xiao and
Predrag Stanimirovi´c
Received: 2 May 2022
Accepted: 22 June 2022
Published: 27 June 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
TR-Net: A Transformer-Based Neural Network for Point
Cloud Processing
Luyao Liu
1
, Enqing Chen
1,2
and Yingqiang Ding
1,
*
1
School of Information Engineering, Zhengzhou University, No. 100 Science Avenue,
Zhengzhou 450001, China; luyao@stu.zzu.edu.cn (L.L.); ieeqchen@zzu.edu.cn (E.C.)
2
Henan Xintong Intelligent IOT Co., Ltd., No. 1-303 Intersection of Ruyun Road and Meihe Road,
Zhengzhou 450007, China
* Correspondence: dyq@zzu.edu.cn
Abstract:
Point cloud is a versatile geometric representation that could be applied in computer
vision tasks. On account of the disorder of point cloud, it is challenging to design a deep neural
network used in point cloud analysis. Furthermore, most existing frameworks for point cloud
processing either hardly consider the local neighboring information or ignore context-aware and
spatially-aware features. To deal with the above problems, we propose a novel point cloud processing
architecture named TR-Net, which is based on transformer. This architecture reformulates the
point cloud processing task as a set-to-set translation problem. TR-Net directly operates on raw
point clouds without any data transformation or annotation, which reduces the consumption of
computing resources and memory usage. Firstly, a neighborhood embedding backbone is designed
to effectively extract the local neighboring information from point cloud. Then, an attention-based
sub-network is constructed to better learn a semantically abundant and discriminatory representation
from embedded features. Finally, effective global features are yielded through feeding the features
extracted by attention-based sub-network into a residual backbone. For different downstream tasks,
we build different decoders. Extensive experiments on the public datasets illustrate that our approach
outperforms other state-of-the-art methods. For example, our TR-Net performs 93.1% overall accuracy
on the ModelNet40 dataset and the TR-Net archives a mIou of 85.3% on the ShapeNet dataset for
part segmentation.
Keywords: point cloud; deep learning; classification; part segmentation; transformer
1. Introduction
Point cloud is a set of points in 3D space that can be viewed as a representation of
object surface. Due to greatly compensating for the lack of spatial structure information
of 2D images, point cloud has been extensively used in various fields such as automatic
drive [
1
], virtual reality [
2
], and intelligent robot technology [
3
,
4
]. These contemporary
applications usually call for advanced processing methods of point cloud. As is well
known, point cloud is unordered and irregular [
5
], which is distinct from 2D images. All
algorithms for point cloud feature extraction, therefore, must be independent of the order
of input points and point cloud is a collection of uneven sampling points. On one hand,
it makes the relationship between points difficult to be used for extracting features. On
the other hand, convolutional neural networks, which have already been applied in image
and video processing, are not applicable to be used in point cloud processing directly. This
research focuses on shape classification and part segmentation of point cloud, which are
two basic and challenging tasks that have received a lot of attention from researchers in
point cloud processing.
In the early stages of point cloud research, most researchers usually convert point
cloud data into regular 3D voxel grids [
6
] or a collection of images before feeding them into
Machines 2022, 10, 517. https://doi.org/10.3390/machines10070517 https://www.mdpi.com/journal/machines