Approved for Public Release 15-3236; ©2015-The MITRE Corporation. All rights reserved.
1
Dropped Pronoun Recovery in Chinese SMS
Chris Giannella and Ransom Winder
The MITRE Corporation
7515 Colshire Drive
McLean, VA 22102, USA
{cgiannella,rwinder}@mitre.org
Department of Linguistics
Georgetown University
3700 O Street NW
Washington, DC 20057, USA
sjp62@georgetown.edu
Abstract
In written Chinese, personal pronouns are commonly dropped when they can be
inferred from context. This practice is particularly common in informal genres like
Short Message Service (SMS) messages sent via cell phones. Restoring dropped
personal pronouns can be a useful preprocessing step for information extraction.
Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting
dropped personal pronoun slots and (2) determining the identity of the pronoun for
each slot. We address a simpler version of restoring dropped personal pronouns
wherein only the person numbers are identified. After applying a word segmenter, we
used a linear-chain conditional random field (CRF) to predict which words were at the
start of an independent clause. Then, using the independent clause start information,
as well as lexical and syntactic information, we applied a CRF or a maximum-entropy
classifier to predict whether a dropped personal pronoun immediately preceded each
word and, if so, the person number of the dropped pronoun. We conducted a series of
experiments using a manually annotated corpus of Chinese SMS messages. Our
machine-learning–based approaches substantially outperformed a rule-based
approach based partially on rules developed by Chung and Gildea in 2010. Features
derived from parsing did not help our approaches. We conclude that the parse
information is largely superfluous for identifying dropped personal pronouns if
reasonably accurate independent clause start information is available.
1. Introduction
Chinese is commonly characterized as a “pro-drop” language (Baran, Yang, & Nianwen,
2012), (Huang, 1989) since pronouns are commonly dropped when they can be
inferred from context. This practice is particularly common in informal genres like
This author’s work was carried while she was a summer intern at the MITRE Corporation.