抑制

ID：32533

阅读量：0

大小：0.54 MB

页数：36页

时间：2023-01-29

金币：5

上传者：战必胜

Aggregate Suppression for Enterprise Search Engines

Mingyang Zhang

George Washington University

Washington, DC 20052, USA

mingyang@gwu.edu

Nan Zhang

∗

George Washington University

Washington, DC 20052, USA

nzhang10@gwu.edu

Gautam Das

†

University of Texas at Ar lington

Arlington, TX 76019, USA

gdas@uta.edu

ABSTRACT

Many enterprise websites provide search engines to facilitate cus-

tomer access to their underlying documents or data. With the web

interface of such a search engine, a customer can specify one or

a few keywords that he/she is interested in; and the search engine

returns a list of documents/tuples matching the user-speciﬁed key-

words, sorted by an often-proprietary scoring function.

It was traditionally believed that, because of its highly-restrictive

interface (i.e., keyword search only, no SQL-style queries), such a

search engine serves its purpose of answering individual keyword-

search queries without disclosing big-picture aggregates over the

data which, as we shall show in the paper, may incur signiﬁcant

privacy concerns to the enterprise. Nonetheless, recent work on

sampling and aggregate estimation over a search engine’s corpus

through its keyword-search interface transcends this traditional be-

lief. In this paper, we consider a novel problem of suppressing

sensitive aggregates for enterprise search engines while maintain-

ing the quality of answers provided to individual keyword-search

queries. We demonstrate the effectiveness and efﬁciency of our

novel techniques through theoretical analysis and extensive experi-

mental studies.

Categories and Subject Descriptors

H.2.7 [Database Administration]; H.3.5 [Online Information Ser-

vices]: Web-based services

General Terms

Algorithms, Measurement, Performance

Keywords

Search Engine, Sampling, Aggregate Suppression

∗

Partially supported by NSF grants 0852674, 0915834, 1117297 and a

GWU Research Enhancement Fund.

†

Partially supported by NSF grants 0812601, 0915834, 1018865, a NHARP

grant from the Texas Higher Education Coordinating Board, and grants

from Microsoft Research and Nokia Research.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD ’12, May 20–24, 2012, Scottsdale, Arizona, USA.

1. INTRODUCTION

Enterprise Search Engine: With web-scale search engines (e.g.,

Google, Bing) becoming the de facto way for web users to lo-

cate and access resources of interest, many enterprises have coped

with the trend by offering their customers enterprise search en-

gines for accessing the large amounts of documents or (structured)

data inside the enterprise’s website. Examples of such enterprise

search engines range from keyword-based product search provided

by online retailers such as Amazon.com, to case search provided

by many government agencies such as the US patent ofﬁce, and to

article search provided by almost all online content providers (e.g.,

Washington Post).

In general, the web interface of such a search engine offers a

text-box input through which a customer can specify one or a few

keywords that he/she is interested in. The search engine will re-

turn a list of documents or tuples that match the user-speciﬁed key-

words

. Because of the length limit of a return web page, not all

matching documents/tuples may be returned. Instead, the top-k re-

sults are selected according to a scoring function often proprietary

to the enterprise, and then returned to the customer.

Privacy Concerns on Sensitive Aggregates: While an enterprise

search engine is designed to answer individual search queries spec-

iﬁed by its customers, recent sampling-based techniques have been

developed ( [9, 26]) that allow third party applications to issue queries

to the search interface (queries are randomly selected from a query

pool), and from the answers piece together big-picture aggregate

information about the underlying corpus. Although in some cases

it is advantageous to have such aggregate information disclosed,

there are others for which an enterprise would not willingly like to

disclose aggregate information through its search engine. To un-

derstand why, consider the following examples.

• Commercial Competition: Many online retailers allow a cus-

tomer to search for keywords in product reviews left by other

customers, so that the customer can quickly locate products

with desired properties. Nonetheless, the retailer may not

want a competitor to learn certain aggregate information which

places the retailer in a disadvantageous position for compe-

tition - e.g., if the competitor learns that the total number of

products in the retailer’s website which have ”poor quality"

in a user review is twice as large as the number in the com-

petitor’s own website, then the competitor may use the in-

For keyword search over structured data, note that while there has

been recent work [16] on supporting keyword search over the JOIN

of multiple tables, most real-world search engines simply consider

each tuple as a document consisting of all attribute values of the

tuple, and process the keyword-search query in (almost) the same

way as search over unstructured documents.

469

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 36



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

抑制

最近更新

大家都在看

相关文章

相关标签