Check out the results of the international Information Extraction competition entitled "Pascal Challenge: Evaluating Machine Learning for Information Extraction".
Given a standardised corpus of annotated and pre-processed documents, the participants were asked to perform a number of Information Extraction tasks.
Nowadays large part of knowledge is stored in unstructured textual format. Big companies have millions of documents, very
often stored in different parts of the world, but available via intranets. Textual documents cannot be queried in simple ways
and therefore the contained knowledge can neither be used by automatic systems, nor be easily managed by humans. This means
that knowledge is difficult to capture, share and reuse among employees, reducing the company’s efficiency and
competitiveness. Moreover in a moment in which companies are more and more valued for their “intangible assets” (i.e. the
knowledge they own), the presence of unmanageable knowledge implies a loss in terms of company’s value.
The Semantic Web initiative addresses these issues. The effort behind the Semantic Web is to add content to web documents in
order to access knowledge instead of unstructured material, allowing knowledge to be managed in an automatic way. Much work
is done on (1) the definition of standards for organization of knowledge (e.g. XML, RDF, OIL), (2) the definition of
structures for knowledge organization (e.g. ontologies) and (3) the population of such knowledge structures. (1) and (2)
actually provide the necessary infrastructure for the Semantic Web. (3) actually requires methodologies for marking-up
documents. It is probably reasonable to expect users to manually annotate new documents to a certain degree, but this does
not solve the problem of old documents containing unstructured material. In any case we cannot expect everyone to manually
mark up every produced mail or document, as this would be impossible. Moreover some users may need to extract and use
different or additional information from the one provided by the creator, or the creator is willing to provide.
For the reasons mentioned above it is vital for the Semantic Web to produce automatic or semi-automatic methods for
extracting information from web-related documents, either for helping in annotating new documents or to extract additional
information from existing unstructured or partially structured documents. Given the increase in the use of the Web (and in
the future of the Semantic Web) for Knowledge Management (KM) – this is vital also for KM tout court. In this context,
Information Extraction from texts (IE) is one of the most promising areas of Human Language Technologies (HLT) for KM. IE is
an automatic method for locating important facts in electronic documents for successive use, e.g. for annotating documents or
for information storing for further use (such as populating an ontology with instances).
IE is the perfect support for
knowledge identification and extraction from Web documents as it can – for example - provide support in documents
analysis either in an automatic way (unsupervised extraction of information) or semi-automatic way (e.g. as support for human
annotators in locating relevant facts in documents, via information highlighting). A main challenge for IE
for the next years is to enable people with knowledge of Semantic Web but no or scarce preparation in IE and Computational
Linguistics to build new applications/cover new domains. This is particularly important for the broader field of Knowledge
Management (KM) whether web based or not: IE is just one of the many technologies to be used in building complex
applications: wider acceptance of IE will come only when IE tools will not require any specific skill apart from notions of
KM. The potential market for such tools is quite large.
The consortium will study, design and implement innovative methodologies for KM based on the use of IE. From the scientific
point of view we will focus on two aspects that are symmetric: how the use in KM poses requirements and challenges to IE and
how the use of IE changes KM. From the practical point of view we will define tools and methodologies for IE-based KM.
Concerning IE we will focus on the study and implementation of user-driven Information Extraction systems that can easily be
ported to new application domains by using limited or no knowledge of Natural Language Processing. We intend to support users
during the whole development of IE applications in a comprehensive way from IE applications de-sign to the release of the final
application. In particular we intend to focus on adaptive IE technology using Machine Learning, capitalizing on the experience
of the consortium members, who are internationally recognised to be among the leaders in the field.
From the KM point of view,
we will design and implement a methodology for KM that uses IE as a method for capturing knowledge in textual documents. We
will study how IE can impact the reuse, sharing and diffusion of knowledge within a company and how IE can be integrated with
existing KM tools. We will base our work on the past experience of consortium members, who are internationally recognised to be
among the leaders in the field.
You may find the final reports of the project here.