Citation: Li, Y.; Wen, M.; Fei, J.; Shen,
J.; Cao, Y. A Fine-Grained Modeling
Approach for Systolic Array-Based
Accelerator. Electronics 2022, 11, 2928.
https://doi.org/10.3390/
electronics11182928
Academic Editors: Phivos Mylonas,
Katia Lida Kermanidis and Manolis
Maragoudakis
Received: 23 August 2022
Accepted: 13 September 2022
Published: 15 September 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Article
A Fine-Grained Modeling Approach for Systolic
Array-Based Accelerator
Yuhang Li , Mei Wen *, Jiawei Fei, Junzhong Shen and Yasong Cao
School of Computer Science, National University of Defense Technology, Changsha 410073, China
* Correspondence: meiwen@nudt.edu.cn
Abstract:
The systolic array provides extremely high efficiency for running matrix multiplication
and is one of the mainstream architectures of today’s deep learning accelerators. In order to develop
efficient accelerators, people usually employ simulators to make design trade-offs. However, current
simulators suffer from coarse-grained modeling methods and ideal assumptions, which limits their
ability to describe structural characteristics of systolic arrays. In addition, they do not support the
exploration of microarchitecture. This paper presents FG-SIM, a fine-grained modeling approach
for evaluating systolic array accelerators by using an event-driven method. FG-SIM can obtain
accurate results and provide the best mapping scheme for different workloads due to its fine-grained
modeling technique and deny of ideal assumption. Experimental results show that FG-SIM plays a
significant role in design trade-offs and outperforms state-of-the-art simulators, with an accuracy of
more than 95%.
Keywords: modeling; systolic array; accelerator
1. Introduction
Deep neural networks (DNNs) have come to play an increasingly significant role
in image recognition, speech recognition, text classification and other fields [
1
–
3
]. As
application requirements have continued to increase, complex network models and a
large numbers of parameters have resulted in higher computation time and declining
performance. Deploying hardware accelerators is a common way for people to solve such
problems. These accelerators [
4
–
10
] use different dataflows and hardware architectures
to accelerate the computation in different ways. Existing work shows that dataflow in
particular has a substantial impact on data reuse and hardware utilization [7,11,12].
With the goal of finding efficient dataflows and hardware architectures, many previous
works [
7
,
13
–
18
] have attempted to explore the design space with varying degrees of success.
The systolic array has also become one of the mainstream DNN accelerator architectures
due to its unique structural characteristics. However, in order to improve the generality
of the models, these works [
16
–
18
] are modeled in an overly abstract way, making it
difficult to fully describe the specific implementation details of the hardware, and the
simulation results obtained are far from the real situation. In addition, these models [
15
–
18
]
are established based on certain assumptions, such as a lack of correlation between data,
sufficient data supply during computations, etc. These assumptions are often difficult to
implement in practice, and the resulting problems are difficult to capture in the model. As
a result, the pauses caused by these problems can also lead to inaccurate simulation results.
On the other hand, when the dataflow and hardware structure have been pretedermined,
the hardware will also employ different scheduling operations and data segmentation
during computations, which we refer to as different mappings. Since these different
mappings can have a significant impact on performance, it is also very important to quickly
search for and identify the best mapping for different workloads under given conditions.
Much of the previous work [
17
,
18
] has not considered this issue and thus has not been able
Electronics 2022, 11, 2928. https://doi.org/10.3390/electronics11182928 https://www.mdpi.com/journal/electronics