Logo Dot.Kom
IST Logo

PASCAL Challenge

Check out the results of the international Information Extraction competition entitled "Pascal Challenge: Evaluating Machine Learning for Information Extraction". Given a standardised corpus of annotated and pre-processed documents, the participants were asked to perform a number of Information Extraction tasks. [More details...]

Nowadays large part of knowledge is stored in unstructured textual format. Big companies have millions of documents, very often stored in different parts of the world, but available via intranets. Textual documents cannot be queried in simple ways and therefore the contained knowledge can neither be used by automatic systems, nor be easily managed by humans. This means that knowledge is difficult to capture, share and reuse among employees, reducing the company’s efficiency and competitiveness. Moreover in a moment in which companies are more and more valued for their “intangible assets” (i.e. the knowledge they own), the presence of unmanageable knowledge implies a loss in terms of company’s value.

The Semantic Web initiative addresses these issues. The effort behind the Semantic Web is to add content to web documents in order to access knowledge instead of unstructured material, allowing knowledge to be managed in an automatic way. Much work is done on (1) the definition of standards for organization of knowledge (e.g. XML, RDF, OIL), (2) the definition of structures for knowledge organization (e.g. ontologies) and (3) the population of such knowledge structures. (1) and (2) actually provide the necessary infrastructure for the Semantic Web. (3) actually requires methodologies for marking-up documents. It is probably reasonable to expect users to manually annotate new documents to a certain degree, but this does not solve the problem of old documents containing unstructured material. In any case we cannot expect everyone to manually mark up every produced mail or document, as this would be impossible. Moreover some users may need to extract and use different or additional information from the one provided by the creator, or the creator is willing to provide.

For the reasons mentioned above it is vital for the Semantic Web to produce automatic or semi-automatic methods for extracting information from web-related documents, either for helping in annotating new documents or to extract additional information from existing unstructured or partially structured documents. Given the increase in the use of the Web (and in the future of the Semantic Web) for Knowledge Management (KM) – this is vital also for KM tout court. In this context, Information Extraction from texts (IE) is one of the most promising areas of Human Language Technologies (HLT) for KM. IE is an automatic method for locating important facts in electronic documents for successive use, e.g. for annotating documents or for information storing for further use (such as populating an ontology with instances).

IE is the perfect support for knowledge identification and extraction from Web documents as it can – for example - provide support in documents analysis either in an automatic way (unsupervised extraction of information) or semi-automatic way (e.g. as support for human annotators in locating relevant facts in documents, via information highlighting). A main challenge for IE for the next years is to enable people with knowledge of Semantic Web but no or scarce preparation in IE and Computational Linguistics to build new applications/cover new domains. This is particularly important for the broader field of Knowledge Management (KM) whether web based or not: IE is just one of the many technologies to be used in building complex applications: wider acceptance of IE will come only when IE tools will not require any specific skill apart from notions of KM. The potential market for such tools is quite large.


The consortium will study, design and implement innovative methodologies for KM based on the use of IE. From the scientific point of view we will focus on two aspects that are symmetric: how the use in KM poses requirements and challenges to IE and how the use of IE changes KM. From the practical point of view we will define tools and methodologies for IE-based KM.

Concerning IE we will focus on the study and implementation of user-driven Information Extraction systems that can easily be ported to new application domains by using limited or no knowledge of Natural Language Processing. We intend to support users during the whole development of IE applications in a comprehensive way from IE applications de-sign to the release of the final application. In particular we intend to focus on adaptive IE technology using Machine Learning, capitalizing on the experience of the consortium members, who are internationally recognised to be among the leaders in the field.

From the KM point of view, we will design and implement a methodology for KM that uses IE as a method for capturing knowledge in textual documents. We will study how IE can impact the reuse, sharing and diffusion of knowledge within a company and how IE can be integrated with existing KM tools. We will base our work on the past experience of consortium members, who are internationally recognised to be among the leaders in the field.

Achieved Results

You may find the final reports of the project here.

created and maintained by José Iria