adaptive IE tool

Fabio Ciravegna, Department of Computer Science, University of Sheffield



An Adaptive Information Extraction Tool for the Semantic Web

Amilcare is an IE system designed as support to document annotation in the SW framework. Amilcare is an adaptive IE system, i.e. it uses machine learning to adapt to new applications/domains. It is rule based, i.e. its learning algorithm induces rules that extract information. Rules are learnt by generalising over a set of examples found in a training corpus annotated with XML tags. The system learns how to reproduce such annotation via Information Extraction.

Amilcare can work in three modes of operation - training mode, test mode and production mode.

The training mode is used to induce rules, so to learn how to perform IE in a specific application scenario. Input in training mode is: (1) a scenario (e.g. an ontology in the SW); (2) a training corpus annotated with the information to be extracted. Output of the training phase is a set of rules able to reproduce annotation on texts of the same type.

The testing mode is used to test the induced rules on an unseen tagged corpus, so to understand how well it performs for a specific application. When running in test mode Amilcare first of all removes all the annotations from the corpus, then re-annotates the corpus using the induced rules. Finally the results are automatically compared with the original annotations and the results are presented to the user. Output of the test phase is: (1) the corpus reannotated by the system; (2) a set of accuracy statistics on the test corpus: recall, precision and details on the mistakes the system does. During testing it is possible to decide to retrain the learner with different system parameters in order to tune its accuracy (e.g. to obtain more recall and/or more precision). Tuning takes a fraction of time with respect to training.

The production mode is used when an application is released. Amilcare annotates the provided documents. If a user is available to revise its results, the learner uses the user corrections to retrain. The training/test/production modes can actually be interleaved so to produce an annotation based on active learning. In active learning user annotation and system annotation are interleaved in order to minimize the amount of user annotation. This greatly reduces the burden of document annotation.

This adaptive methodology meets a number of the requirements imposed by SW usage scenarios:

  • portability by a wide range of users, from naive users to IE experts; read more
  • ability to cope with different types of texts (including mixed ones); read more
  • possibility to be inserted in the usual user annotation environment providing minimum disruption to usual annotation activities; read more
  • portability with reduced number of texts. read more
<< Back Next >>

Last updated: November 24, 2002