adaptive IE tool
















Fabio Ciravegna, Department of Computer Science, University of Sheffield
[F.Ciravegna@dcs.shef.ac.uk]

   

DEVELOPMENT CYCLE WITH AMILCARE


Amilcare comprehensively supports the user in the whole application development cycle, from design to delivery and even during post-marketing assistance via its unique set of tools. Human computer interaction experts and information extraction experts have worked together in the design of tools for user support.

 

The application development cycle is shown in the next figure.

 
Amilcare's development cycle
 

Application development is divided in the following steps:

  1. Application design: the goal of this step is to define a template, i.e., a kind of form the system must fill with the extracted information. Amilcare provides a set of tools for helping the user to identify the correct application settings: it provides a graphical interface that allows information highlighting in text examples, coupled with a set of methods for the semi-automatic organization of information into templates and (in future releases) unsupervised methods for helping identifying the information present in the relevant documents. Considering that choosing a representative set of texts may be difficult, a number of statistical tools are provided for checking the representativeness of the corpus selected by the user, so to avoid the (not infrequent) problems of wrong example selection.
  2. System training: in this phase the system learns how to extract information for a particular application by analysing a number of user-defined examples (i.e. a set of documents with associated the information to be extracted). a simple graphical interface is provided that allows information highlighting via mouse. Considering that providing examples can be tedious, Amilcare provides facilities for reducing the quantity of texts to be tagged via active learning, a strategy that may reduce the need of training examples up to 80%.
  3. Result validation: a fundamental step in the application development is the tuning of results according to the specific application needs: given that a 100% accurate information extraction process is out of grasp of the current technology, it is necessary to be able to balance the ability to find information (recall) with the precision in information identification sot to identify the correct mix of precision and recall. Amilcare provides a set of tools for result monitoring, both from a qualitative point of view (inspecting the system results on a set of test texts with error highlighting) and statistical point of view (accuracy, precision, recall). Amilcare’s tuning interface is designed to bridge the user’s qualitative vision (“you are not capturing enough information”) with the numerical concepts the system is able to manipulate (e.g. moving error thresholds in order to obtain higher recall). CPU time needed for retuning is 1/10 of the initial learning time.
  4. Application delivery: once the system performance has been tuned to the application needs the information extraction engine can be delivered as a black box module to be integrated in the user environment. A powerful API allow text feeding and result extraction.
  5. Post-marketing monitoring : Amilcare provides tools that are fundamental once the application has been delivered to the final user. They allow to statistically compare both the corpus received for analysis and the results obtained at training/testing time with those on the corpus received. This is fundamental because the kind of texts received can change in time (e.g. initially only very short texts were received but then long texts start to appear) and the user must be sure that such a change (that may not be noticed by the system administrator) does not affect the system performances. Moreover Amilcare is also able to statistically monitor its accuracy on new texts by measuring the statistical distribution of identified information across texts and issue worning in case such distribution radically differs from the one observed on the training corpus.
 
<< Back Next >>
 

Last updated: November 24, 2002