adaptive IE tool

Fabio Ciravegna, Department of Computer Science, University of Sheffield



The use of IE for the SW requires an extension of the concept of text types to new, unexplored dimensions. As a matter of fact linguistically-based methodologies used for free texts can be difficult to apply or even ineffective on highly structured texts such as web pages produced by databases. They are not able to cope with the variety of extralinguistic structures (e.g. XML tags, document formatting, and stereotypical language) that are used to convey information in such documents. On the other hand, wrapper-like algorithm designed for highly structured HTML/XML pages are largely ineffective on unstructured texts (e.g. free texts). This is because such methodologies make scarce (or no) use of NLP, avoiding any generalization over the flat word sequence, tending to be ineffective on free texts for example because of data sparseness. The challenge is developing methodologies able to fill the gap between the two approaches in order to cope with different text types. This is particular important for the SW, as Web pages can actually contain documents of any type and even a mix of text types, e.g., they can contain both free, semi-structured and structured texts at the same time.

Moreover, the need in terms of amount of annotated material for training must be reduced in order to minimize the burden on the user side (minor cost, reduced development/maintenance time). Concerning porting across text types, in classical Natural Language Processing (NLP) adapting to new text types has been generally considered as a task of porting across different types of free texts.

The IE methodology for the SW must be adaptive in the sense of being able to recognize what (linguistic) resources are useful for analyzing a specific context, sometimes even a context local to a specific information: e.g. if a piece of information is in a table, a deep linguistic method relying on a parserís results is unlikely to be effective, but when the information is in a free text zone, the same method is likely to work. To our knowledge none of the classic IE methods is adaptive in this sense, but we believe that this is a main requirement for the application for the SW.

<< Back Next >>

Last updated: November 24, 2002