
IE CORPORA
In this page we make available "corrected" versions of some
standard reference corpora for IE (i.e., Seminar Announcements, Job
Postings, Corporate Acquisitions, ...), some of which already available in the
RISE repository[1]. This effort is part of an
activity related to the evaluation methodology for IE [2,3]carried on by
Mary Elaine Califf (Illinois State University), Fabio Ciravegna
(University of Sheffield), Dayne Freitag (Fair Isaac Corporation),
Nick Kushmerick (University College Dublin), and the Dot.Kom group at
ITC-irst (i.e., Claudio Giuliano, Alberto Lavelli and Lorenza Romano).
Seminar Announcements
Fabio Ciravegna and Leon Peshkin kindly provided their own
"corrected" versions of Seminar Announcements, which were
the starting point for the revision process. Currently, the second
revised (and XML-compliant) version of the Seminar Announcements is
available (more details about the changes with respect to the version
available in RISE can be found below):
- Seminar Announcements v1.2: version 1.2
was produced by Dayne Freitag and merged the slightly divergent
versions available as v1.1 (see below). Main changes with respect to
version v1.1 (further details can be found in the
README.txt file available in the distribution):
- the Windows convention of naming files was adopted
- all
<sentence> and <paragraph>
tags were stripped from the corpus
- the documents were made XML-compliant
- Seminar Announcements v1.1: the changes of this version with
respect to version v1.0 (i.e., the RISE version) consist only in the
corrections of obvious errors (more
details about the corrections)
Corporate Acquisitions
- Acquisitions v1.1: the main
change is that now the documents are XML-compliant. Please, note that
this dataset was not available in the RISE repository.
Italian Legal Corpus
The corpus is composed of 197 Sentences from Corte Cassazione (the Italian High Court) in HTML format (2500 - 3000 words each)
downloaded from the public site of Italian Mininstry of Justice between 2000 and 2003. This corpus has been fully annotated -
using Melita - according to a legal ontology dealing with laws, sentences, tribunals, judges etc. The corpus may be made available
upon request.
References
[1]
RISE. A Repository of Online Information Sources Used in Information Extraction Tasks
Information Sciences Institute / USC, 1998.
[2] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne
Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano.
IE evaluation: Criticisms and recommendations.
In Proceedings of the AAAI-04
Workshop on Adaptive Text Extraction and Mining (ATEM-2004),
San Jose, California, 26 July 2004.
[3] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne
Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano.
A Critical Survey of the Methodology for IE Evaluation.
In Proceedings of the
4th International Conference on Language Resources and Evaluation (LREC 2004),
Lisbon, Portugal, 26-28 May 2004.
|