Workshop Overview

Traditional approaches to the development and evaluation of Information Extraction (IE) systems have relied on relatively small collections of up to a few hundred documents tagged with detailed semantic annotations. While this paradigm has enabled rapid advances in IE technology, it remains constrained by a dependence on annotated documents and does not make use of the information available in large corpora. Alternative approaches, which make use of large text collections and inter-document information, are now beginning to emerge -- as evidenced by a parallel emergence of interest in learning from unlabelled data in AI in general. For example, some systems learn extraction patterns by exploiting information about their distribution across corpora; others exploit the redundancy of the Internet by assuming that facts with multiple mentions are more reliable. These approaches require large amounts of unannotated text, which is generally easy to obtain, and employ unsupervised or minimally supervised learning algorithms, as well as related techniques such as co-training and active learning. These alternative approaches are complementary to the established IE paradigm based on supervised training, and are now forming a cohesive emergent trend in recent research. They will constitute the focus of this workshop.

There are several advantages to employing large text collections for IE. They provide enormous amounts of training data, albeit mostly unannotated. Facts can be extracted from, or verified across, multiple documents. Large text collections often contain vast amounts of redundancy in the form of multiple references to or mentions of closely related facts. Redundancy can be exploited in the IE setting to identify trends and patterns within the text, e.g., by means of Data Mining techniques.

This workshop invites new, original work on learning extraction rules or identifying facts across document boundaries while exploiting sizable amounts of unlabelled text in the training stage, in the extraction stage, or both. The workshop hopes to bring together researchers from the various related areas, such as Information Extraction, Data Mining, biomedical text processing, Question Answering, Information Retrieval, Machine Learning, identification of lexical relations (hyponymy, meronymy etc.), multi-lingual text processing and the Semantic Web. This workshop solicits papers on all relevant aspects, including algorithms, techniques and applications.

Topics of particular interest include:

Workshop Organizers

Mary Elaine Califf (Illinois State University)
Mark A. Greenwood (University of Sheffield)
Mark Stevenson (University of Sheffield)
Roman Yangarber (University of Helsinki)

Program Committee

Markus Ackermann (University of Leipzig)
Amit Bagga (AskJeeves)
Roberto Basili (University of Rome, Tor Vergata)
Antal van den Bosch (Tilburg Uniersity)
Neus Catala (Universitat Politècnica de Catalunya)
Walter Daelemans (University of Antwerp)
Jenny Rose Finkel (Stanford University)
Robert Gaizauskas (University of Sheffield)
Ralph Grishman (NYU)
Takaaki Hasegawa (NTT)
Heng Ji (NYU)
Nick Kushmerick (University College Dublin, Ireland)
Alberto Lavelli (ITC-IRST, Italy)
Gideon Mann (John Hopkin's University)
Ion Muslea (Language Weaver Inc.)
Chikashi Nobata (Sharp, Japan)
Ellen Riloff (University of Utah)
Stephen Soderland (University of Washington)
Yorick Wilks (University of Sheffield)