Desparately Seeking Cebuano
Desparately Seeking Cebuano
Douglas W. Oard, David Doermann, Bonnie Dorr, Daqing He, Philip Resnik, and Amy Weinberg
UMIACS, University of Maryland, College Park, MD, 20642 USA
(oard,doermann,bonnie,resnik,weinberg)@umiacs.umd.edu
William Byrne, Sanjeev Khudanpur and David Yarowsky
CLSP, Johns Hopkins University, 3400 North Charles Street, Barton Hall, Baltimore, MD 21218
(byrne,khudanpur,yarowsky)@jhu.edu
Anton Leuski
USC Information Sciences Institute, 4676 Admiralty Way, Marina Del Rey, CA 90292
(byrne,khudanpur,yarowsky)@jhu.edu
Abstract
Development of interactive Cross-Language Informa-
tion Retrieval (CLIR) systems that can be rapidly adapted
At 4:13 A.M. Eastern Standard Time on
to accommodate new languages has been the focus of ex-
Wednesday March 5, Cebuano was designated
tensive collaboration between the University of Maryland
as the language for the TIDES surprise lan-
and The Johns Hopkins University. The capability for
guage dry run. This paper reports the results
rapid development of necessary language resources is an
of the first 60 hours of our data collection and
essential part of that process, so we had been planning to
implementation effort.
participate in the surprise language dry run to refine our
procedures for sharing those resources with other mem-
1
Introduction
bers of the TIDES community. Naturally, we chose CLIR
as a driving application to focus our effort. Our goal,
The Los Angeles Times reported that at about 5:20 P.M.
therefore, was to build an interactive system that would
on Tuesday March 4, 2003, a bomb concealed in a back-
allow a searcher to pose English queries to find relevant
pack exploded at the airport in Davao City, the second
Cebuano news articles from the period immediately fol-
largest city in the Philippines. At least 23 people were
lowing the bombing.
reported dead, with more than 140 injured, and President
This paper describes the first 60 hours of our data
Arroyo of the Philippines characterized the blast as a ter-
collection and implementation efforts. The next section
rorist act (?). With the 13 hour time difference, it was
identifies the critical language resources needed for this
then at 4:20 A.M on the same date in Washington, DC.
application and describe our process for assembling and
Twenty-four hours later, at 4:13 A.M. on March 5, partic-
assessing those resources. Section ?? describe the design
ipants in the Translingual Information Detection, Extrac-
of our CLIR system and explain the process that we used
tion and Summarization (TIDES) program were notified
to adapt that system to Cebuano. Section ?? presents the
that Cebuano had been chosen as the language of interest
results of an initial usability study to explore the utility
for a “surprise language” practice exercise that had been
of our CLIR system to searchers with and without Ce-
planned quite independently to begin on that date. The
buano language skills. Finally, the paper concludes with
notification observed that Cebuano is spoken by 24and
a brief discussion of our plans for further work with Ce-
that it is the lingua franca in the south Philippines, where
buano and a recounting of some of the lessons that we
the event occurred.
have already learned.
One goal of the TIDES program is to develop the abil-
ity to rapidly deploy a broad array of language technolo-
2
Obtaining Language Resources
gies for previously unforeseen languages in response to
unexpected events. That capability will be formally ex-
Our basic approach to development of an agile system for
ercised for the first time during June 2003, in a month-
interactive CLIR relies on three strategies: (1) create an
long “Surprise Language Experiment.” To prepare for
infrastructure in advance for English as a query language
that event, the Linguistic Data Consortium (LDC) orga-
that makes only minimal assumptions about the docu-
nized a “dry run” for March 5-14 in order to refine their
ment language; (2) leverage the asymmetry inherent in
procedures for rapidly developing language resources of
the problem by assembling strong resources for English
the type that the TIDES community will need during the
in advance; and (3) develop a robust suite of capabilities
July evaluation.
to exploit any language resources that can be found for
the “surprise language.” We defer the first two topics to
ously scanned Cebuano-English dictionary to iden-
the next section, and focus here on the third.
tify each dictionary entry, performed optical charac-
We know of five possible sources of translation exper-
ter recognition, and parsed the entries to construct a
tise:
bilingual term list. We were aided in this process by
the fact that Cebuano is written in a Roman script,
Informants. People who know the language are an ex-
but our initial results were adversely affected by
cellent source of insight, and universities are an ex-
poor image quality. During the remaining 42 hours
cellent place to find people that know a wide array
before submission of this paper, we obtained four
of languages. We were able to locate an informant
additional printed dictionaries, broke their bindings
within 50 feet of our office, and to schedule an in-
and scanned them, and then performed entry zoning,
terview within 36 hours of the announcement of the
OCR and entry parsing.
language.
As this description illustrates, these five sources pro-
Academic literature. Major research universities are
vide complementary information. Since there is some
also an excellent place to find written materials de-
uncertainty at the outset about how long it will be before
scribing a broad array of languages. Within 12 hours
each delivers useful results, we chose a strategy based
of the announcement, reference librarians at the Uni-
on concurrency, balancing our investment over each the
versity of Maryland had identified a textbook on
five sources. This allowed us to use whatever resources
“Beginning Cebuano,” and we had located a copy
became available first to get an initial system running,
at the University of Southern California. Together
with refinements subsequently being made as additional
with the excellent electronic resources located by
resources became available. Because Cebuano and En-
the LDC, this allowed us to begin development of
glish are written in the same script, we did not need char-
a rudimentary stemmer.
acter set conversion or phonetic cognate matching in this
Translation lexicons. Simple bilingual term lists are
case. The interactive CLIR system described in the next
available for many language pairs. Using links pro-
section was therefore constructed using only English re-
vided by the LDC and our own Web searches, we
sources that were (or could have been) pre-assembled, a
were able to construct an English-Cebuano term list
Cebuano-English bilingual term list, the rule-based stem-
with over 14,000 translation pairs within 12 hours of
mer that we constructed based on the academic literature
the announcement. This largely duplicated a simul-
and our discussion with our informant, and the Cebuano
taneous effort at the LDC, and we later merged our
Bible.
term list with theirs.
3
Building a Cross-Language Retrieval
Parallel text. Translation-equivalent documents, when
System
aligned at the word level, provide an excellent
source of information about not just possible trans-
Ideally, we would like to build a system that would find
lations, but their relative predominance. Within 24
whatever documents the searcher would wish to read in a
hours of the announcement we had aligned Cebuano
fully automatic mode. In practice, fully automatic search
and English versions of the Holy Bible at the word
systems are imperfect even in monolingual applications.
level using Giza++. The Bible’s vocabulary covers
We therefore designed an interactive approach that func-
only about half of the words found in typical En-
tions something like a typical Web search engine: (1)
glish news text (counted by-token), so it is useful
the searcher poses their query in English, (2) the sys-
to have additional sources of parallel text. For this
tem ranks the Cebuano documents in decreasing order
reason, we have extended the previously developed
of likely relevance to the query, (3) the searcher exam-
STRAND system to locate likely translations in the
ines a list of document titles in something approximat-
Internet Archive, the largest collection of Web doc-
ing English, and (4) the searcher may optionally exam-
uments that is presently available for research use.
ine the full text of any document in something approx-
The first results from that process were not yet avail-
imating English. The intent is to support an iterative
able 60 hours after the announcement when this pa-
process in which searchers learn to better express their
per was submitted.
query through experience. We are only able to provide
very rough translations, so we expect that such a sys-
Printed Dictionaries. People learning a new language
tem would be used in an environment where searchers
make extensive use of bilingual dictionaries, so we
could send documents that appear promising off for pro-
have developed a system that mimics that process
fessional translation when necessary.
to some extent. Within 18 hours of the announce-
At the core of our system is the capability to auto-
ment we had zoned page images from a previ-
matically rank Cebuano documents based on an English
query. We chose a query translation architecture using
to display only the single best translation. When reliable
backoff translation (Resnik et al., 2001) and Pirkola’s
translation probability statistics (from parallel text) are
structured query method (Pirkola, 1998), implemented
not available, we use the relative word unigram frequency
using Inquery version 3.1p1. The key idea in backoff
of each translation of a Cebuano term in a representative
translation is to first try to find consecutive sequences of
English collection as a substitute for that probability.
query words on the English side of the bilingual term list,
Our query translation process can operate in a fully
where that fails to try to find the surface form of each
automatic mode, but in order to provide greater explain-
remaining English term, to fall back to stem matching
ability (and thus improved controllability), we have also
when necessary, and ultimately to fall back to retaining
implemented an optional user-assisted query translation
the English term unchanged in the hope that it might be
mode. When that mode is selected, the system displays
a proper name or some other form of cognate with Ce-
each Cebuano translation for a query term and allows
buano. Accents are stripped from the documents and all
the searcher to deselect inappropriate translations. The
language resources to facilitate matching at that final step.
meaning of each known translation is indicated in English
The key idea behind Pirkola’s structured query method is
by displaying either reverse translations (English words
to compute term weights in the query language (rather
that share the same Cebuano translation) or an example
than in the document language) by separately estimat-
of the usage of that translation (found in the term-aligned
ing the term frequency and document frequency statis-
Bible) (?).
tics for each query term based on that query term’s set
of known translation alternatives from the bilingual term
4
Usability Assessment
list. Our present system does not employ blind relevance
We would like to do a small usability study with one per-
feedback, which is known to significantly improve cross-
son that knows Cebuano but is not a search professional
language search performance, but potentially at the cost
(our informant) and one professional searcher that does
of less explainability, and hence less controllability, in in-
not (a reference librarian). But this paper is already too
teractive applications. Modern Web search engines typi-
long, so you will have to come to the conference to hear
cally omit this feature for a similar reason.
the results!
Although we have chosen techniques that are relatively
robust and therefore require relatively little domain-
5
Looking Ahead
specific tuning, stemmer design is an area of uncertainty
that could adversely affect retrieval effectiveness. We
Complete details will be available by the time of the con-
therefore needed a test collection on which we could try
ference.
out variants of the Cebuano stemmer. We built this test
Acknowledgment
collection using 34,000 Cebuano Bible verses and 600
English questions that we found on the Web for which
The authors are grateful to Tim Hackman, Burcu
appropriate Bible verses were known. Each question was
Karagol-Ayan, Okan Kolak, Anton Leuski, Huanfeng
posed as a query using the batch mode of Inquery, and the
Ma, Dan Melamed, Karen Patterson, Michael Subotin,
rank of the known relevant verse was taken as a measure
Jianqiang Wang and the Linguistic Data Consortium for
of effectiveness. We took the mean reciprocal rank (the
their assistance with this effort.
This work has been
inverse of the harmonic mean) as a figure of merit for
supported in part by DARPA cooperative agreement
each configuration, and used a Wilcoxon paired signed
N660010028910.
ranked test (with p¡0.05) to assess the statistical signifi-
cance of observed differences. Mean reciprocal rank is
often used as a measure of effectiveness when model-
References
ing known-item retrieval tasks, and it has been found to
Ari Pirkola. 1998. The effects of query structure and dic-
be useful for detecting poor system configurations. The
tionary setups in dictionary-based cross-language in-
measure does not typically distinguish well among fairly
formation retrieval. In Proceedings of the 21st Annual
similar systems, however.
International ACM SIGIR Conference on Research and
The other key capability that is needed is title and doc-
Development in Information Retrieval, pages 55–63,
ument translation. We accomplish this in the simplest
August.
way possible: we reverse the bilingual term list, and we
Philip Resnik,
Douglas Oard,
and Gina Levow.
reverse the role of Cebuano and English in the process
2001.
Improved cross-language
retrieval us-
described above for query translation. Our user interface
ing backoff translation.
In First International
is capable of displaying multiple translations for a sin-
Conference
on
Human
Language
Technologies.
gle term (arranged horizontally for compact depiction or
http://www.glue.umd.edu/ oard/research.html.
vertically for clearer depiction), but searchers can choose