CLSAC2018 迁移线程架构上稀疏矩阵向量乘法的优化

VIP文档

ID：31001

阅读量：0

大小：5.77 MB

页数：1页

时间：2023-01-15

金币：10

上传者：战必胜

20000

40000

60000

80000

100000

120000

140000

ford1

cop20k_A

webbase-1M

rmat

nd24k

audikw_1

MB/s

Bandwidth: Reordering Techniques

Broadwell Xeon - 32 threads

NONE RANDOM BFS METIS

100

150

200

250

300

350

400

ford1

cop20k_A

webbase-1M

rmat

nd24k

audikw_1

MB/s

Bandwidth: Reordering Techniques

8 nodelets - 64 threads per nodelet

NONE RANDOM BFS METIS

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ford1

cop20k_A

webbase-1M

rmat

nd24k

audikw_1

coefficient of variation

Coefficient of Variation: Mem Instrs Per Nodelet

8 nodelets - 64 threads per nodelet

ROW NON-ZERO

100

150

200

250

300

350

400

ford1

cop20k_A

webbase-1M

rmat

nd24k

audikw_1

MB/s

Bandwidth: Row VS Non-zero Distribution

8 nodelets - 64 threads per nodelet

ROW NON-ZERO

Optimizations for Sparse Matrix-Vector Multiply

on a Migratory Thread Architecture

Thomas Rolinger

1,2

; Christopher Krieger

University of Maryland College Park,

Laboratory for Physical Sciences

tbrolin@cs.umd.edu, krieger@lps.umd.edu

Sparse Matrices Evaluated

§ Work distribution and load balancing is of similar importance to

reducing migrations in order to achieve high performance.

§ Explicitly enforcing hardware load balancing for the Emu

architecture is difficult due to thread migrations. Specifically, data

placement and access patterns dictate the work performed by a

given hardware resource and is irrespective of how much work is

initially delegated to each processing element.

§ Use of matrix reordering is more beneficial on the Emu

architecture than a traditional cache-memory based system. We

found that performance can be increased by as much as 70% on

Emu while we observed a maximum gain of 16% on a traditional

architecture. A random reordering can exhibit better performance

on Emu than not reordering at all, which contradicts what we

observe on a traditional system.

Conclusions

Implementation

Name Rows Non-Zeros Density

ford1 18K 100K 2.9 x 10

-4

cop20k_A 120K 2.6M 1.79 x 10

-4

webbase-1M* 1M 3.1M 3.11 x 10

-6

rmat* 445K 7.4M 3.74 x 10

-5

nd24k 72K 28.7M 5.54 x 10

-3

audikw_1 943K 77.6M 8.72 x 10

-5

CSR matrix distributed across 4 nodelets

Abstract Work Distribution Strategies

Spy plots for cop20k_A when reordered

Load Balancing via Matrix Reordering

Results: Matrix reordering can improve SpMV performance on Emu by as much as

70% while a comparable SpMV implementation executed on a traditional architecture

only receives at most a 16% improvement from matrix reordering.

Sparse linear algebra kernels are a crucial component in many large-

scale data analytic applications, such as tensor decomposition and

graph analytics. Many of these kernels are comprised of Sparse

Matrix-Vector Multiplication (SpMV), which is one of the

fundamental operations in linear algebra. Achieving high

performance for SpMV on today’s cache-memory based systems is

challenging due to irregular access patterns and weak locality. To

address these challenges, novel systems such as the Emu

architecture have been proposed. The Emu design uses light-weight

migratory threads, narrow memory, and near-memory processing

capabilities to address weak locality and reduce the total load on the

memory system.

In this work, we evaluate the impact of traditional optimizations for

SpMV on the Emu migratory thread architecture. Our goal is to gain

insight into the cost-benefit tradeoffs of standard sparse algorithm

optimizations on Emu hardware.

Emu Architecture

The basic building block of an Emu system is a nodelet which

consists of the following:

§ Gossamer Cores (GCs): general purpose, cache-less processors

that support up to 64 concurrent light-weight threads

§ Narrow Channel DRAM: eight 8-bit channels rather than a

single, wider 64-bit interface

§ Memory-side Processor: performs atomic and remote

operations

8 nodelets are combined together to make up a single node in the

Emu architecture, as shown below.

When a thread on a GC makes a memory request to a remote

address, a migration is generated. A migration involves:

§ A GC issuing a request to the Nodelet Queue Manager (NQM) to

migrate the thread context to the nodelet where the desired

data resides

§ The thread contexts waits in the source nodelet’s migration

queue until it is accepted by the Migration Engine (ME), which is

the communication fabric that connects multiple nodelets

§ Once accepted, the thread context is sent over the ME and is

processed by the destination nodelet’s NQM.

The size of a thread context is roughly 200 bytes.

We leverage the Compressed Sparse Row (CSR) storage format to

store sparse matrices, where blocks of rows are distributed to each

nodelet. An example of this distribution is shown below.

The optimizations we consider in this work are work distribution

strategies and matrix reordering techniques.

Obtained from the University of Florida Sparse Matrix Collection.

RMAT matrix generated with a = 0.45, b = 0.22 and c = 0.22 All

matrices are square. “*” denotes non-symmetric matrices.

§ Row: Evenly divide the rows amongst the nodelets

§ Non-zero: Assign rows to such each nodelet receives roughly the same number of

non-zeros.

Results: Distributing by non-zero provides more uniform load balancing by enforcing

each nodelet to issue a comparable amount of memory instructions. This leads to

better overall performance for SpMV.

Despite efforts to lay out and distribute work evenly across the nodelets, all of the

threads could migrate to a single nodelet and oversubscribe that nodelet's resources.

Known matrix reordering techniques can be used to encourage more consistent load

balancing.

§ BFS and METIS: cluster non-zeros on the main diagonal and produce balanced

rows.

§ Random: uniformly spreads out the non-zeros.

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 1



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

CLSAC2018 迁移线程架构上稀疏矩阵向量乘法的优化

最近更新

大家都在看

相关文章

相关标签