Aggregate Suppression for Enterprise Search Engines
Mingyang Zhang
George Washington University
Washington, DC 20052, USA
mingyang@gwu.edu
Nan Zhang
∗
George Washington University
Washington, DC 20052, USA
nzhang10@gwu.edu
Gautam Das
†
University of Texas at Ar lington
Arlington, TX 76019, USA
gdas@uta.edu
ABSTRACT
Many enterprise websites provide search engines to facilitate cus-
tomer access to their underlying documents or data. With the web
interface of such a search engine, a customer can specify one or
a few keywords that he/she is interested in; and the search engine
returns a list of documents/tuples matching the user-specified key-
words, sorted by an often-proprietary scoring function.
It was traditionally believed that, because of its highly-restrictive
interface (i.e., keyword search only, no SQL-style queries), such a
search engine serves its purpose of answering individual keyword-
search queries without disclosing big-picture aggregates over the
data which, as we shall show in the paper, may incur significant
privacy concerns to the enterprise. Nonetheless, recent work on
sampling and aggregate estimation over a search engine’s corpus
through its keyword-search interface transcends this traditional be-
lief. In this paper, we consider a novel problem of suppressing
sensitive aggregates for enterprise search engines while maintain-
ing the quality of answers provided to individual keyword-search
queries. We demonstrate the effectiveness and efficiency of our
novel techniques through theoretical analysis and extensive experi-
mental studies.
Categories and Subject Descriptors
H.2.7 [Database Administration]; H.3.5 [Online Information Ser-
vices]: Web-based services
General Terms
Algorithms, Measurement, Performance
Keywords
Search Engine, Sampling, Aggregate Suppression
∗
Partially supported by NSF grants 0852674, 0915834, 1117297 and a
GWU Research Enhancement Fund.
†
Partially supported by NSF grants 0812601, 0915834, 1018865, a NHARP
grant from the Texas Higher Education Coordinating Board, and grants
from Microsoft Research and Nokia Research.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD ’12, May 20–24, 2012, Scottsdale, Arizona, USA.
Copyright 2012 ACM 978-1-4503-1247-9/12/05 ...$10.00.
1. INTRODUCTION
Enterprise Search Engine: With web-scale search engines (e.g.,
Google, Bing) becoming the de facto way for web users to lo-
cate and access resources of interest, many enterprises have coped
with the trend by offering their customers enterprise search en-
gines for accessing the large amounts of documents or (structured)
data inside the enterprise’s website. Examples of such enterprise
search engines range from keyword-based product search provided
by online retailers such as Amazon.com, to case search provided
by many government agencies such as the US patent office, and to
article search provided by almost all online content providers (e.g.,
Washington Post).
In general, the web interface of such a search engine offers a
text-box input through which a customer can specify one or a few
keywords that he/she is interested in. The search engine will re-
turn a list of documents or tuples that match the user-specified key-
words
1
. Because of the length limit of a return web page, not all
matching documents/tuples may be returned. Instead, the top-k re-
sults are selected according to a scoring function often proprietary
to the enterprise, and then returned to the customer.
Privacy Concerns on Sensitive Aggregates: While an enterprise
search engine is designed to answer individual search queries spec-
ified by its customers, recent sampling-based techniques have been
developed ( [9, 26]) that allow third party applications to issue queries
to the search interface (queries are randomly selected from a query
pool), and from the answers piece together big-picture aggregate
information about the underlying corpus. Although in some cases
it is advantageous to have such aggregate information disclosed,
there are others for which an enterprise would not willingly like to
disclose aggregate information through its search engine. To un-
derstand why, consider the following examples.
• Commercial Competition: Many online retailers allow a cus-
tomer to search for keywords in product reviews left by other
customers, so that the customer can quickly locate products
with desired properties. Nonetheless, the retailer may not
want a competitor to learn certain aggregate information which
places the retailer in a disadvantageous position for compe-
tition - e.g., if the competitor learns that the total number of
products in the retailer’s website which have ”poor quality"
in a user review is twice as large as the number in the com-
petitor’s own website, then the competitor may use the in-
1
For keyword search over structured data, note that while there has
been recent work [16] on supporting keyword search over the JOIN
of multiple tables, most real-world search engines simply consider
each tuple as a document consisting of all attribute values of the
tuple, and process the keyword-search query in (almost) the same
way as search over unstructured documents.