随机化_一般化

ID:32523

大小:0.42 MB

页数:12页

时间:2023-01-29

金币:5

上传者:战必胜
Randomized Generalization for Aggregate Suppression
Over Hidden Web Databases
Xin Jin, Nan Zhang
George Washington University
{xjin, nzhang10}@gwu.edu
Aditya Mone, Gautam Das
University of Texas at Arlington
{aditya.mone, gdas}@uta.edu
ABSTRACT
Many web databases are hidden behind restrictive form-like inter-
faces which allow users to execute search queries over the under-
lying hidden database. While it is important to support such search
queries, many hidden database owners also want to maintain a cer-
tain level of privacy for aggregate information over their databases,
for reasons including business secrets and homeland security. Ex-
isting work on aggregate suppression thwarts the uniform random
sampling of a hidden database, but cannot address recently pro-
posed attacks which accurately estimate SUM and COUNT queries
without the need to first draw a uniform random sample. In this pa-
per, we consider the problem of suppressing SUM and COUNT
queries over a hidden database. In particular, we develop ran-
domized generalization, a novel technique which provides rigid
aggregate-suppression guarantee while maintaining the utility of
individual search queries. We present theoretical analysis and ex-
tensive experiments to illustrate the effectiveness of our approach.
1. INTRODUCTION
1.1 Problem Definition and Motivation
Hidden web databases provide form-like search interfaces that
allow users to issue search queries by specifying desired attribute
values of the sought-after tuple(s), and the system responds by re-
turning a few (e.g., top-k) tuples that satisfy the selection condi-
tions, sorted by a suitable scoring function. Here are two examples:
Example 1: Many online auto dealers (e.g.,Yahoo! Autos) offer
a web form that lets a user choose values for attributes such as
ZIP code, car model, mileage, age, price, etc. The top-k answers,
sorted according to a scoring function such as distance or price,
are presented to the user, where k is a small constant such as 50.
Example 2: Almost all major airline companies (e.g., American
Airlines) provide form-like interfaces for a user to search for flights.
Partially supported by NSF grants 0852673, 0852674, 0845644, 0915834
and a GWU Research Enhancement Fund.
Partially supported by NSF grants 0812601, 0915834, 1018865, a NHARP
grant from the Texas Higher Education Coordinating Board, and grants
from Microsoft Research and Nokia Research.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 37th International Conference on Very Large Data Bases,
August 29th - September 3rd 2011, Seattle, Washington.
Proceedings of the VLDB Endowment, Vol. 4, No. 11
Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.
American Airlines’ website, for example, lets a user choose values
for six attributes: departure city, arrival city, number of stops, de-
parture date, cabin, and carrier. The top-50 answers are returned.
Note that while a hidden web database supports individual search
queries, it does not allow aggregate queries, e.g., the total number
of used hybrid cars in an online dealer’s database, to be executed
through the interface
1
. Because of the top-k constraint, such aggre-
gates cannot be inferred from a search query answer either. Thus,
it was traditionally believed that aggregate information is shielded
from external applications by the restrictive web interface and can-
not be efficiently retrieved without first crawling a large number
of tuples from the hidden database. Nonetheless, our recent work
(e.g., [7, 16]) shows that, through a carefully designed workload of
search queries, one can indeed leverage the public web interface to
accurately estimate aggregate information over a hidden database.
A recent paper introduced the novel challenge of privacy preser-
vation of aggregates in such hidden databases [8]. This paper ob-
served that while it is important for owners of hidden databases
to support search queries outlined in the above examples, many
of them also want to maintain a certain level of privacy for ag-
gregates over their hidden databases (e.g., as business secrets). In
Example 1, the auto dealer would not be willing to disclose ag-
gregate information that enables competitors to infer its inventory,
e.g., that a certain popular car is in short supply at the dealership
around the country. If the competitors were able to infer such ag-
gregate information, then they would be able to take advantage of
the low inventory by a multitude of tactics (e.g., stock that vehicle,
make appropriate adjustments to price). Likewise, in Example 2,
airline companies, for both security and business reasons, do not
want to make public information that allows terrorists to predict
which flight on which date is more likely to be relatively empty. In
recent hijackings such as 9/11 and Russian aircraft bombing, ter-
rorists’ tactics are believed to be hijacking a relatively empty flight
so there would be less resistance from the passengers.
This novel notion of aggregates suppression is in sharp contrast
to traditional privacy scenarios, where individual tuple needs to be
protected but aggregate information that does not lead to inference
to an individual’s information is considered acceptable disclosure.
As argued in [8], traditional tuple-wise privacy preservation tech-
niques such as encryption [1], data perturbation [2] and general-
ization methods [22] cannot be used for aggregates suppression as
obfuscating individual data tuples is not an option since these tuples
must be made visible to normal search users.
The defense mechanism COUNTER-SAMPLER proposed in [8]
represented an important first step towards addressing this impor-
tant problem. However, it suffers from two crucial shortcomings:
1
Note that even interfaces which choose to display certain aggregates (e.g.,
COUNT) often only offer coarse estimations (e.g., the COUNT returned by
Google is notoriously inaccurate).
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭