Logo Dot.Kom
IST Logo


In this page we make available "corrected" versions of some standard reference corpora for IE (i.e., Seminar Announcements, Job Postings, Corporate Acquisitions, ...), some of which already available in the RISE repository[1]. This effort is part of an activity related to the evaluation methodology for IE [2,3]carried on by Mary Elaine Califf (Illinois State University), Fabio Ciravegna (University of Sheffield), Dayne Freitag (Fair Isaac Corporation), Nick Kushmerick (University College Dublin), and the Dot.Kom group at ITC-irst (i.e., Claudio Giuliano, Alberto Lavelli and Lorenza Romano).

Seminar Announcements

Fabio Ciravegna and Leon Peshkin kindly provided their own "corrected" versions of Seminar Announcements, which were the starting point for the revision process. Currently, the second revised (and XML-compliant) version of the Seminar Announcements is available (more details about the changes with respect to the version available in RISE can be found below):

  • Seminar Announcements v1.2: version 1.2 was produced by Dayne Freitag and merged the slightly divergent versions available as v1.1 (see below). Main changes with respect to version v1.1 (further details can be found in the README.txt file available in the distribution):
    • the Windows convention of naming files was adopted
    • all <sentence> and <paragraph> tags were stripped from the corpus
    • the documents were made XML-compliant
  • Seminar Announcements v1.1: the changes of this version with respect to version v1.0 (i.e., the RISE version) consist only in the corrections of obvious errors (more details about the corrections)

Corporate Acquisitions

  • Acquisitions v1.1: the main change is that now the documents are XML-compliant. Please, note that this dataset was not available in the RISE repository.

Italian Legal Corpus

The corpus is composed of 197 Sentences from Corte Cassazione (the Italian High Court) in HTML format (2500 - 3000 words each) downloaded from the public site of Italian Mininstry of Justice between 2000 and 2003. This corpus has been fully annotated - using Melita - according to a legal ontology dealing with laws, sentences, tribunals, judges etc. The corpus may be made available upon request.


[1] RISE. A Repository of Online Information Sources Used in Information Extraction Tasks Information Sciences Institute / USC, 1998.

[2] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano. IE evaluation: Criticisms and recommendations. In Proceedings of the AAAI-04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), San Jose, California, 26 July 2004.

[3] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano. A Critical Survey of the Methodology for IE Evaluation. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 26-28 May 2004.

created and maintained by José Iria