CLSAC2018 迁移线程架构上稀疏矩阵向量乘法的优化

VIP文档

ID:31001

大小:5.77 MB

页数:1页

时间:2023-01-15

金币:10

上传者:战必胜
0
20000
40000
60000
80000
100000
120000
140000
ford1
cop20k_A
webbase-1M
rmat
nd24k
audikw_1
MB/s
Bandwidth: Reordering Techniques
Broadwell Xeon - 32 threads
NONE RANDOM BFS METIS
0
50
100
150
200
250
300
350
400
ford1
cop20k_A
webbase-1M
rmat
nd24k
audikw_1
MB/s
Bandwidth: Reordering Techniques
8 nodelets - 64 threads per nodelet
NONE RANDOM BFS METIS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ford1
cop20k_A
webbase-1M
rmat
nd24k
audikw_1
coefficient of variation
Coefficient of Variation: Mem Instrs Per Nodelet
8 nodelets - 64 threads per nodelet
ROW NON-ZERO
0
50
100
150
200
250
300
350
400
ford1
cop20k_A
webbase-1M
rmat
nd24k
audikw_1
MB/s
Bandwidth: Row VS Non-zero Distribution
8 nodelets - 64 threads per nodelet
ROW NON-ZERO
Optimizations for Sparse Matrix-Vector Multiply
on a Migratory Thread Architecture
Thomas Rolinger
1,2
; Christopher Krieger
2
1
University of Maryland College Park,
2
Laboratory for Physical Sciences
tbrolin@cs.umd.edu, krieger@lps.umd.edu
Sparse Matrices Evaluated
§ Work distribution and load balancing is of similar importance to
reducing migrations in order to achieve high performance.
§ Explicitly enforcing hardware load balancing for the Emu
architecture is difficult due to thread migrations. Specifically, data
placement and access patterns dictate the work performed by a
given hardware resource and is irrespective of how much work is
initially delegated to each processing element.
§ Use of matrix reordering is more beneficial on the Emu
architecture than a traditional cache-memory based system. We
found that performance can be increased by as much as 70% on
Emu while we observed a maximum gain of 16% on a traditional
architecture. A random reordering can exhibit better performance
on Emu than not reordering at all, which contradicts what we
observe on a traditional system.
Conclusions
Implementation
Name Rows Non-Zeros Density
ford1 18K 100K 2.9 x 10
-4
cop20k_A 120K 2.6M 1.79 x 10
-4
webbase-1M* 1M 3.1M 3.11 x 10
-6
rmat* 445K 7.4M 3.74 x 10
-5
nd24k 72K 28.7M 5.54 x 10
-3
audikw_1 943K 77.6M 8.72 x 10
-5
CSR matrix distributed across 4 nodelets
Abstract Work Distribution Strategies
Spy plots for cop20k_A when reordered
Load Balancing via Matrix Reordering
Results: Matrix reordering can improve SpMV performance on Emu by as much as
70% while a comparable SpMV implementation executed on a traditional architecture
only receives at most a 16% improvement from matrix reordering.
Sparse linear algebra kernels are a crucial component in many large-
scale data analytic applications, such as tensor decomposition and
graph analytics. Many of these kernels are comprised of Sparse
Matrix-Vector Multiplication (SpMV), which is one of the
fundamental operations in linear algebra. Achieving high
performance for SpMV on todays cache-memory based systems is
challenging due to irregular access patterns and weak locality. To
address these challenges, novel systems such as the Emu
architecture have been proposed. The Emu design uses light-weight
migratory threads, narrow memory, and near-memory processing
capabilities to address weak locality and reduce the total load on the
memory system.
In this work, we evaluate the impact of traditional optimizations for
SpMV on the Emu migratory thread architecture. Our goal is to gain
insight into the cost-benefit tradeoffs of standard sparse algorithm
optimizations on Emu hardware.
Emu Architecture
The basic building block of an Emu system is a nodelet which
consists of the following:
§ Gossamer Cores (GCs): general purpose, cache-less processors
that support up to 64 concurrent light-weight threads
§ Narrow Channel DRAM: eight 8-bit channels rather than a
single, wider 64-bit interface
§ Memory-side Processor: performs atomic and remote
operations
8 nodelets are combined together to make up a single node in the
Emu architecture, as shown below.
When a thread on a GC makes a memory request to a remote
address, a migration is generated. A migration involves:
§ A GC issuing a request to the Nodelet Queue Manager (NQM) to
migrate the thread context to the nodelet where the desired
data resides
§ The thread contexts waits in the source nodelets migration
queue until it is accepted by the Migration Engine (ME), which is
the communication fabric that connects multiple nodelets
§ Once accepted, the thread context is sent over the ME and is
processed by the destination nodelets NQM.
The size of a thread context is roughly 200 bytes.
We leverage the Compressed Sparse Row (CSR) storage format to
store sparse matrices, where blocks of rows are distributed to each
nodelet. An example of this distribution is shown below.
The optimizations we consider in this work are work distribution
strategies and matrix reordering techniques.
Obtained from the University of Florida Sparse Matrix Collection.
RMAT matrix generated with a = 0.45, b = 0.22 and c = 0.22 All
matrices are square. “*” denotes non-symmetric matrices.
§ Row: Evenly divide the rows amongst the nodelets
§ Non-zero: Assign rows to such each nodelet receives roughly the same number of
non-zeros.
Results: Distributing by non-zero provides more uniform load balancing by enforcing
each nodelet to issue a comparable amount of memory instructions. This leads to
better overall performance for SpMV.
Despite efforts to lay out and distribute work evenly across the nodelets, all of the
threads could migrate to a single nodelet and oversubscribe that nodelet's resources.
Known matrix reordering techniques can be used to encourage more consistent load
balancing.
§ BFS and METIS: cluster non-zeros on the main diagonal and produce balanced
rows.
§ Random: uniformly spreads out the non-zeros.
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭