CLSAC2017 “分布式集群的基于键的分级”

VIP文档

ID：31019

阅读量：0

大小：0.88 MB

页数：1页

时间：2023-01-18

金币：10

上传者：战必胜

KEYBIN: KEY-BASED BINNING FOR DISTRIBUTED CLUSTERING

XINYU CHEN, JEREMY BENSON, TRILCE ESTRADA, COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF NEW MEXICO

PROBLEM

The Big Data era brings new challenges to ma-

chine learning. Traditional learning algorithms

often require centralized data, but modern data

sets are collected and stored in a distributed way.

We are now facing the following problems:

1. Moving data is expensive

2. Privacy concerns restrict data moving

3. Curse of dimensionality

4. Noisy features in high dimensional data

INTRODUCTION

We present keybin, a scalable and accurate clus-

tering algorithm, suitable for distributed and pri-

vacy constrained environments. Learning from

statistics information, avoid pair-wise distance

computations. Our contributions are:

1. A scalable and accurate clustering approach

2. A math method to discard noisy features

3. Use limited view of data to preserve privacy

4. Compare with other clustering algorithms

AN EXAMPLE OF LEARNING HETEROGENOUS STRUCTURES

Consider data distributed on two sites has differ-

ent distributions. The left plot shows these two

sites and their local data. The shaded plot shows

the true global patterns we want to learn. How-

ever, to move them to a central location is ei-

ther expensive or restricted. keybin computes his-

tograms on each site then aggregates to a global

view of the whole data to assign ﬁnal clusters.

REFERENCES

[1] Agrawal et al, Fast algorithms for mining associa-

tion rules , VLDB, (1994).

[2] Agrawal et al, Automatic subspace clustering of

high dimensional data for data mining applica-

tions, ACM, (1998)

LIMITATION AND FUTURE RESEARCH

keybin assumes features are orthogonal to each

other. In cases when correlated features exist, the

projection of some clusters overlap on correlated

dimensions. This leads false positive in keybin.

We use synthetic data sets to test our methods. We

need to apply keybin to real data sets and ﬁnd out

methods to deal with overlapping clusters.

Full paper is available: http://cs.unm.edu/ xychen/keybin-cluster17.pdf

ACKNOWLEDGEMENTS

This research was supported by the National Sci-

ence Foundation for the grant entitled CAREER:

Enabling Distributed and In-Situ Analysis for Multi-

dimensional Structured Data (NSF ACI-1453430).

EVALUATIONS AND CONCLUSION

• No pair-wise distance computation

• Learn with limited communications

• Scalable with size and dimensionality

METHODS

keybin follows the following method:

1. Assign keys to data points

2. Aggregate global densities

3. Collapsing noisy features

4. Build primary clusters

5. Reduce to ﬁnal clustering

Inspired by hierarchical clustering algorithms.

High dimensional clusters consist of lower di-

mensional primary clusters. Points do not know

other points. Features don not affect each other.

Ideal for embarrassing parallel implementation.

COLLAPSING DIMENSIONS

Some features in high-dimensional data contain

noises. We use Kolmogorov-Smirnov Test to ﬁlter

out noises. We ﬁrst compute a expected KS-score

for all features. Then we discard features that are

different(0.5 σ from the median KS-score).

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 1



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

CLSAC2017 “分布式集群的基于键的分级”

最近更新

大家都在看

相关文章

相关标签