Pascal Challenge

Evaluating Machine Learning for Information Extraction

Home
Registration
Proposal
Corpus
Tasks
Evaluation Server
Contacts


Corpus

We will collect a corpus of 1100 conference workshop call for papers (CFP) from the Web; 600 will be annotated, 500 will be left unannotated. Workshops from a variety of fields will be sampled, e.g. Computer Science, Biomedical, Psychology. However, due to their prevalence on the Web, the majority of the documents are likely to be Computer Science based. The exact task will be defined during the preparation phase, but we expect to require extraction of:

  • Name of Workshop
  • Acronym of Workshop
  • Date of Workshop
  • Location of Workshop
  • Name of Conference
  • Acronym of Conference
  • Homepage of Conference
  • Paper Submission Date (of Workshop)
  • Notification of Acceptance Date (of Workshop)
  • Paper Camera Ready Copy Date (of Workshop)
  • Programme Chair/Co-chairs of Workshop
  • Programme Chair/Co-chairs Affiliation
  • In the preparation phase, we will define the exact experimental setup (both the numerical proportions between the training and test sets and the procedure adopted to select the documents). The experimental setup mentioned in the following is representative of the direction of work, but further discussion is still needed. We will also specify all of the following: (1) a set of fields to extract, (2) the legal numbers of fillers for each field, (3) the possibility of multiple varying occurrences of any particular filler and (4) how stringently matches are evaluated (exact, overlap or contains).

    We will define and implement an evaluation server for the preliminary testing and for testing the final results. This server will be based on the MUC scorer (Douthat 1998). We will define the exact matching strategies by providing the configuration file for each of the tasks selected. Finally we will set up a public location where people will be able to store other new future corpora and expected results, together with the guidelines to be strictly followed for the evaluation. This will guarantee a reliable comparison of the performance of different algorithms even after the PASCAL competition is over. Moreover, it will allow further fair evaluations settings.

    Corpora will be annotated using Melita (Ciravegna et al. 2002), an existing tool that is already under use in scientific and commercial evaluation. Inter-annotator agreement will be guaranteed by a procedure where three annotators will be given overlapping sets of 600 documents to annotate. Discrepancies in annotations (computed automatically by a program) will be discussed among annotators. Annotation will be performed in stages (e.g. 30, 100, 300, 600 documents) with discussion of strategies and discrepancies after every stage.

    Before the beginning of the evaluation, the corpora will be preprocessed using the existing GATE NLP system : documents will be tokenized, annotated with POS tagging, gazetteer information and named entities. The different algorithms will have to use this preprocessed data . This is in order to ensure that they have all access to the same information: in this way, we believe that we will be able to measure the algorithm's ability on a fair and equal base, as already done in other evaluations such as CONNL. Moreover, this will allow researchers to concentrate on the task of learning without having to spend time on the linguistic pre-processor. We also believe that in this way we will enable the participation in the task of researchers with limited or no knowledge of language analysis: they will not risk to be penalised for their inability to define a good linguistic pre-processor. The pre-processing results will be provided as produced by the system; no human correction will be performed. This is so to allow the presence of noise given by real application environments.


    Corpus - Version 1.5

    This is the latest release of the corpus, I would be grateful if you report any errors you discover in the corpus to me, Neil Ireson.

    The Workshop CFP corpus consists of 1100 documents divided into three parts:

    Training Document Set: 400 documents divided into 4 sets of 100. Each of the sets is also subdivided into 10 subsets of 10 documents. For convience the documents are provided both with (".key") and without (".txt") the CFP annotation and with and without the GATE annotations. The documents are named in the following format: "versionnumber-setnumber-subsetnumber-acronyn_year[-gate].[txt|key]"
    Enrich Document Set: 500 documents divided into 2 sets of 250. The first set contains unannotated workshop CFP and the second set unannotated conference CFP.
    Test Document Set: 200 unannotated documents will be provided for a final test in September 2004. Click on each of the active links to downlead the corpus files.

    Training Document Set

    Training Documents: WorkshopCFP-1.5-train.tar.gz
    Training Documents (with CFP annotation): WorkshopCFP-1.5-train-key.tar.gz
    Training Documents (with GATE annotation): WorkshopCFP-1.5-train-gate.tar.gz
    Training Documents (with CFP & GATE annotation): WorkshopCFP-1.5-train-key-gate.tar.gz

    Enrich Document Set

    Enrich Workshop Documents: WorkshopCFP-1.5-enrich.tar.gz
    Enrich Workshop Documents (with GATE annotation): WorkshopCFP-1.5-enrich-gate.tar.gz
    Enrich Conference Documents: ConferenceCFP-1.5-enrich.tar.gz
    Enrich Conference Documents (with GATE annotation): ConferenceCFP-1.5-enrich-gate.tar.gz

    Test Document Set

    Test Documents: WorkshopCFP-1.5-test.tar.gz
    Test Documents (with GATE annotation): WorkshopCFP-1.5-test-gate.tar.gz




    created and maintained by Neil Ireson