adaptive IE tool

Fabio Ciravegna, Department of Computer Science, University of Sheffield



Amilcare tries to learn the best (most reliable) level of language analysis useful (or effective) for a specific IE task by mixing deep linguistic and shallow strategies. The learner starts inducing rules that make no use of linguistic information, like in classic wrapper-like systems. Then it progressively adds linguistic information to its rules, stopping when the use of linguistic information becomes unreliable or ineffective. Linguistic information is provided by generic NLP modules and resources defined once for all and not to be modified by users to specific application needs.

Amilcare is based on the (LP)2 algorithm, a supervised algorithm that falls into a class of Wrapper Induction Systems (WIS) using LazyNLP. (LP)2 induces two types of symbolic rules in two steps: (1) rules that insert annotations in the texts; (2) rules that correct mistakes and imprecision in the annotations provided by (1).

Rules are learnt by generalising over a set of examples marked via XML tags in a training corpus. A tagging rule is composed of a left hand side, containing a pattern of conditions on a connected sequence of words, and a right hand side that is an action inserting an XML tag in the texts. Each rule inserts a single tag, e.g. </speaker>. As positive examples the rule induction algorithm uses XML annotations in a training corpus. The rest of the corpus is considered a pool of negative examples. For each positive example the algorithm: (1) builds an initial rule, (2) generalizes the rule and (3) keeps the k best generalizations of the initial rule.

The algorithmís main loop starts by selecting a tag in the training corpus and extracts from the text a window of w words to the left and w words to the right. Each piece of information stored in the 2*w word window is transformed into a condition in the initial rule pattern, e.g. if the third word is "seminar", a condition word3="seminar" is created. Each initial rule is then generalised and the k best generalisations are kept: retained rules become part of the best rules pool. When a rule enters such pool, all the instances covered by the rule are removed from the positive examples pool, i.e. they will no longer be used for rule induction ((LP)2 is a sequential covering algorithm). Rule induction continues by selecting new instances and learning rules until the pool of positive examples is empty. Some tagging rules (contextual rules) use tags inserted by other rules. For example some rules will be used to close annotations, i.e. they will use the presence of a <speaker> to insert a missing </speaker>.

Further details about the (LP)2 learning algorithm can be found here.

<< Back Next >>

Last updated: November 24, 2002