CLSAC2017 “分布式集群的基于键的分级”

VIP文档

ID:31019

大小:0.88 MB

页数:1页

时间:2023-01-18

金币:10

上传者:战必胜
KEYBIN: KEY-BASED BINNING FOR DISTRIBUTED CLUSTERING
XINYU CHEN, JEREMY BENSON, TRILCE ESTRADA, COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF NEW MEXICO
PROBLEM
The Big Data era brings new challenges to ma-
chine learning. Traditional learning algorithms
often require centralized data, but modern data
sets are collected and stored in a distributed way.
We are now facing the following problems:
1. Moving data is expensive
2. Privacy concerns restrict data moving
3. Curse of dimensionality
4. Noisy features in high dimensional data
INTRODUCTION
We present keybin, a scalable and accurate clus-
tering algorithm, suitable for distributed and pri-
vacy constrained environments. Learning from
statistics information, avoid pair-wise distance
computations. Our contributions are:
1. A scalable and accurate clustering approach
2. A math method to discard noisy features
3. Use limited view of data to preserve privacy
4. Compare with other clustering algorithms
AN EXAMPLE OF LEARNING HETEROGENOUS STRUCTURES
Consider data distributed on two sites has differ-
ent distributions. The left plot shows these two
sites and their local data. The shaded plot shows
the true global patterns we want to learn. How-
ever, to move them to a central location is ei-
ther expensive or restricted. keybin computes his-
tograms on each site then aggregates to a global
view of the whole data to assign final clusters.
REFERENCES
[1] Agrawal et al, Fast algorithms for mining associa-
tion rules , VLDB, (1994).
[2] Agrawal et al, Automatic subspace clustering of
high dimensional data for data mining applica-
tions, ACM, (1998)
LIMITATION AND FUTURE RESEARCH
keybin assumes features are orthogonal to each
other. In cases when correlated features exist, the
projection of some clusters overlap on correlated
dimensions. This leads false positive in keybin.
We use synthetic data sets to test our methods. We
need to apply keybin to real data sets and find out
methods to deal with overlapping clusters.
Full paper is available: http://cs.unm.edu/ xychen/keybin-cluster17.pdf
ACKNOWLEDGEMENTS
This research was supported by the National Sci-
ence Foundation for the grant entitled CAREER:
Enabling Distributed and In-Situ Analysis for Multi-
dimensional Structured Data (NSF ACI-1453430).
EVALUATIONS AND CONCLUSION
No pair-wise distance computation
Learn with limited communications
Scalable with size and dimensionality
METHODS
keybin follows the following method:
1. Assign keys to data points
2. Aggregate global densities
3. Collapsing noisy features
4. Build primary clusters
5. Reduce to final clustering
Inspired by hierarchical clustering algorithms.
High dimensional clusters consist of lower di-
mensional primary clusters. Points do not know
other points. Features don not affect each other.
Ideal for embarrassing parallel implementation.
COLLAPSING DIMENSIONS
Some features in high-dimensional data contain
noises. We use Kolmogorov-Smirnov Test to filter
out noises. We first compute a expected KS-score
for all features. Then we discard features that are
different(0.5 σ from the median KS-score).
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭