半监督

ID:32501

阅读量:0

大小:0.32 MB

页数:8页

时间:2023-01-29

金币:5

上传者:战必胜
A Semi-Supervised Learning Approach To Differential Privacy
Geetha Jagannathan
Department of Computer Science
George Washington University
Washington, D.C.
geetha@geeiscool.com
Claire Monteleoni
Department of Computer Science
George Washington University
Washington, D.C.
cmontel@gwu.edu
Krishnan Pillaipakkamnatt
Department of Computer Science
Hofstra University
Hempstead, N.Y.
csckzp@hofstra.edu
Abstract—Motivated by the semi-supervised model in the
data mining literature, we propose a model for differentially-
private learning in which private data is augmented by public
data to achieve better accuracy. Our main result is a d iffer-
entially private classifier with significantly improved accuracy
compared to previous work. We experimentally demonstrate
that such a classifier produces good prediction accuracies
even in those situations where the amount of priv ate data is
fairly limited. This expands the range of useful applications
of differential privacy since typical results in the differential
privacy model require large private data sets to obtain good
accuracy.
I. INTRODUCTION
Databases that contain sensitive information about in-
dividuals need to be safeguarded from malicious access
that can compromise privacy. Typical attacks assumed an
adversary who can use public information to gain insight
into a target individual’s information stored in the database.
Differential privacy [8] offers protection against such attacks
irrespective of any auxiliary information that may be avail-
able to an adversary trying to breach privacy. In this work,
we consider auxiliary information (i.e. public data) from a
different perpective. We ask the question: Can non-private
data be used to “boost” the accuracy of differentially-private
algorithms? We answer this question in the affirmative.
Motivated by the well known semi-supervised model in
machine learning literature, we propose a learning model
that uses both private and non-private data. We design a
data mining algorithm where non-private data is used in
conjunction with a small amount of private data to increase
the accuracy of differentially-private learners.
Semi-Supervised Learning: In the semi-supervised
model [5] of machine learning, a learner has access to both
labeled and unlabeled data. Usually, the learner has only a
small amount of labeled data available, while the amount of
unlabeled data is much larger. This imbalance in the amounts
of data is due to the fact that labeling training instances can
be an expensive, time-consuming and difficult process that
requires human domain experts. Unlabeled data is usually
much easier to acquire. In the model of semi-supervised
classification the goal is to use both labeled and unlabeled
data to create a classifier that is better than one created using
the labeled data alone. It has been shown in the machine
learning literatur e that in many cases, unlabeled data used
in conjunction with a small amount of labeled data can lead
to an increase in the accuracy of classifiers produced by the
learning algorithms.
Private and Non-private Data: Prior work in the differ-
ential privacy model has assumed that data being analyzed
is en tirely private. However, this is not always a realistic
presumption. Consider the two following scenarios:
1) An organization surveys young customers at a clothing
store about their purchasing habits. Individuals for
whom privacy is a substantial concern may insist that
the organization use their data only in ways that does
not reveal anything about them. On the other hand
some individuals may be willing to give up their
data privacy in exchange for some compensation. The
survey organization would then have a database that
contains both private as well as non-private data. It
would be reasonable to assume that the presence of
non-private data can improve the quality of the data
analysis they would like to perform.
2) A confidential survey about residents in a community
may include information about whether or not each
respondent in the database has an annual salary of
at least $100,000. A public database (such as a voter
registration database) about the same community will
not include such confidential information. An organi-
zation that wants to mine the private database would
perhaps benefit from the public database, even though
it is missing an important confidential attribute.
Our Contributions: The main goal of this paper is to
provide a learning model that addresses the above mentioned
scenarios. We consider the problem of improving the accu-
racy of a differentially private classifier u sing non-private
data when only a small amount of private data is available.
One naive approach is to consider all the non-private data as
additional private data and construct a differentially private
classifier on the combined data. In doing so, the classifier
pays the cost of privacy even when it not required to do
so for non-private data. This can result in lowered accuracy.
In contrast, our techniqu e initially constructs a differentially
private classifier from the private data and then uses the non-
2013 IEEE 13th International Conference on Data Mining Workshops
978-0-7695-5109-8/13 $31.00 © 2013 IEEE
DOI 10.1109/ICDMW.2013.131
841
2013 IEEE 13th International Conference on Data Mining Workshops
978-0-7695-5109-8/13 $31.00 © 2013 IEEE
DOI 10.1109/ICDMW.2013.131
841
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭