Logo Dot.Kom
IST Logo
Home
Partners
Publications
Deliverables
Technology
Events
Resources
Contact


INFORMATION EXTRACTION TOOLS

jInFil (ITC-Irst)

jInFil is an open source Java tool for Instance Filtering developed at ITC-irst. Instance Filtering is a preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. jInFil is released as free software with full source code, provided under the terms of the Apache License, Version 2.0.

Web page: http://tcc.itc.it/research/textec/tools-resources/jinfil.html

jFex (ITC-Irst)

jFex is a Feature Extraction tool for Natural Language Processing applications based on machine learning techniques developed at ITC-irst. jFex is written in Java, so it runs on Mac OS X, OS/2, Unix, VMS and Windows. jFex generates the features specified by a feature extraction script, indexes them, and returns the example set, as well as the mapping between the features and their indices (lexicon). If specified, it only extracts features for the instances not marked as 'uninformative' by jInFil. jFex is strongly inspired by FEX, but it introduces several improvements. First of all, it provides an enriched feature extraction language. Secondly, it makes possible to further extend this language through a Java API, providing a flexible tool to define task specific features. Finally, jFex can output the example set in formats directly usable by SVMlight and SNoW. jFex is released as free software with full source code, provided under the terms of the Apache License, Version 2.0. The project source code is being hosted by SourceForge.

Web page: http://tcc.itc.it/research/textec/tools-resources/jfex.html

SIE (ITC-Irst)

SIE (Simple Information Extraction) is an information extraction system based on a supervised machine learning technique for extracting implicit relations from documents. In particular, Information Extraction (IE) is cast as a classification problem by applying Support Vector Machines (SVMs) to build a set of classifiers for detecting the boundaries of the entities to be extracted. SIE was designed with the goal of being easily and quickly portable across tasks and domains. A set of experiments on several tasks in different domains have shown that SIE is competitive compared with the state-of-the-art systems, and it often outperforms systems customized to a specific domain. A key property of SIE is to reduce the computation effort exploiting instance filtering. This feature allows to scale from toy problems to real-world datasets making SIE attractive in applicative fields, such as bioinformatics.

Web page: http://tcc.itc.it/research/textec/tools-resources/sie.html

TRex (Sheffield)

The Trainable Relation Extraction framework has been developed as a testbed for experimenting with several extraction algorithms and scenarios. The framework aims to be general enough to support a variety of entity extraction and relation extraction algorithms from several input formats. T-Rex is publicly available for downloading.

Web page: http://tyne.shef.ac.uk/t-rex

Amilcare (Sheffield)

Amilcare is an IE system designed as support to document annotation in the SW framework. Amilcare is an adaptive IE system, i.e. it uses machine learning to adapt to new applications/domains. It is rule based, i.e. its learning algorithm induces rules that extract information. Rules are learnt by generalising over a set of examples found in a training corpus annotated with XML tags. The system learns how to reproduce such annotation via Information Extraction. Amilcare's rule induction algorithm has recently been generalised to take semantic informration into account (previous versions only used linguistic information). This extension allows Amilcare to make use of information in ontologies during this process.

Web page: http://nlp.shef.ac.uk/amilcare

Armadillo (Sheffield)

The Armadillo Architecture is a knowledge mining system used to extract information from heterogeneous sources. It is designed to be generic enough to cater for most domains. The underlying idea is to have a system based around Semantic Web Services (SWS). By this we are referring to a system whereby its underlying functions are distributed in an environment (normally the web) and these functions accept an input and give some form of output. We also go a step further by specifying that they must be semantically enabled, i.e. the inputs and outputs are semantically typed and therefore can refer to anything (being a concrete object, an abstract concept, etc.) that has some sort of meaning. The task of the system is to obtain some information as input regarding a particular domain and give back to the user a structured view of all semantic relations branching from the inputed data. Armadillo achieves this by making use of the information overload implicit in the redundancy of information on the Internet. Redundancy is apparent in the presence of multiple citations of the same facts in superficially different formats. This redundancy is exploited to bootstrap an annotation process needed for Information Extraction, thus enabling production of machine-readable content for the Semantic Web.

Web page: http://nlp.shef.ac.uk/armadillo

PANKOW/C-PANKOW (Karlsruhe)

PANKOW (Pattern-based Annotation through Knowledge on the Web) is a system for automatically annotating instances in a web page with respect to a given ontology. It implements an unsupervised approach to information extraction in the sense that no labeled data is needed to train the system. Instead, the system relies on the frequency of occurence of certain patterns on the Web to derive formal annotations. Recently, the system has been extended to also take into account the web page the instance to be annotated appears in as context as well as to annotate web pages with respect to WordNet.

Web page: http://km.aifb.uni-karlsruhe.de/pankow/

TIES (ITC-Irst)

TIES (Trainable Information Extraction System) is an Adaptive Information Extraction (IE) system currently under development at ITC-irst within the Dot.Kom project. TIES is based on a Java reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick.

TIES is implemented in Java and runs on all platforms supporting Java. Release 1.2 of TIES has been delivered in September 2003. TIES is currently available for download only to Dot.Kom partners but we plan to make it available for research purposes also outside the Dot.Kom consortium.

Web page: http://tcc.itc.it/research/textec/tools-resources/ties.html

ESpotter (Open Unviersity)

ESpotter, is a domain adaptable named entity recognition (NER) system which adapts patterns and lexicons to domains on the Web for efficient and effective NER

Web page: http://kmi.open.ac.uk/people/jianhan/ESpotter/

IE-BASED KNOWLEDGE MANAGEMENT TOOLS

CORDER (Open University)

CORDER discovers relations from the Web pages of a community. Its approach is based on co-occurrences of NEs and the distances between them. For a given NE, there are a number of co-occurring NEs. We assume that NEs that are closely related to each other tend to appear together more often and closer to each other in Web pages. We calculate a relation strength for each co-occurring NE based on its co-occurrences and distances from the given NE. The co-occurring NEs are ranked by their relation strengths.

ASDI (Automated Semantic Data Integration) (Open University)

ASDI is a system for automatically populating semantic websites through which knowledge is extracted, verified, and integrated from the underlying heterogeneous sources in a fully automated way. ASDI provides comprehensive support for adding semantics to web sites by i) targeting entire web sites rather than isolated web pages and ii) ensuring the generation of high quality mark-up.

K@/SemantiK (Quinary)

K@ (to be read 'kat') is a collaborative web-based platform for knowledge management. Available since 2002 leveraging R&D results from the Peking project, it has been enhanced by a semantic layer - the SemantiK plugin. K@/SemantiK is able to maintain the association between documents and semantic annotations with respect to a formal ontology according to Semantic Web standards. It is a platform featuring integration, searching, presentation and editing of knowledge expressed through Semantic Web languages.

Web page: http://www.quinary.com/pagine/innovation/dotkomprojmatl_frame.htm

AktiveDoc (Sheffield)

AktiveDoc is a system for supporting knowledge management in the process of document editing and reading. Its main feature is to support users (both readers and writers) in timely sharing and reusing relevant knowledge.

Magpie (Open University)

Browsing the Web involves two basic tasks: (i) finding the right web page and (ii) making sense of its content. Whilst there is a body of research related to the former relatively little work has investigated how the interpretation of web content can be supported. Magpie supports sense making by semantically marking up web pages on-the-fly. A browser plugin allows interesting items to be highlighted within a web page and knowledge services to be invoked.

A seminal study of how users browse the web found that 58% of web pages visited are revisits and that 90% of all user actions are related to navigation. Magpie is able to automatically track the items found during a browsing session using a semantic log facilitating the provision of trigger services. Trigger services are activated when a specific pattern of items has been found within the semantic log. One type of trigger service offered in Magpie is a collector, which collects items from a browsing session using an ontology based filter.

Web page: http://kmi.open.ac.uk/projects/magpie/

Aqualog (Open University)

AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input and returns answers drawn from one or more knowledge bases (KBs), which instantiate the input ontology with domain-specific information [Lopez 04]. It addresses the task of searching for concrete answers to a specific question. A key feature of AquaLog is that it is modular with respect to the input ontology, the aim here being that it should be zero cost to switch from one ontology to another when using AquaLog.

Web page: http://kmi.open.ac.uk/projects/akt/aqualog/

OntoBroker (ontoprise)

Ontobroker may be seen as a middleware run-time system to provide an information delivering base for intranet and extranet applications, for knowledge management systems, for e-commerce systems and in general for intelligent applications. Ontobroker integrates the access to different information sources like databases, keyword based search engines etc. It reads various input formats like EXCEL, XML, OXML, RDF, DAML+OIL, F-Logic, Prolog. Thus it provides a homogenous access to an inhomogeneous set of information sources and input formats.

Ontobroker is normally accompanied by a set of other tools. OntoEdit is an interactive editor to define ontologies, describe instances, define rules etc. OntoEdit may be configured with a light-weight version of our inference engine to support the development and debugging of rules. SemanticMiner is a web-based solution, where ontologies are used for query expansion, document retrieval, visualization and classification.

Web page: http://ontobroker.semanticweb.org

OntoStudio (ontoprise)

During the continuation of the project it has been noticed that a more flexible framework for ontology management is needed that is easy extensible and fulfils the following needs: Smooth integration of own plug-ins, one framework for every ontology-relevant task, scalable architecture, smart functionalities included (like basic style sheets for layout, auto-completion mechanisms, administration, management of the framework.

One open-source framework fulfilled these needs: The Eclispe project of the Eclipse Foundation. In fact, OntoStudio is being built on top of Eclipse and covers the following features:

  • Editing of concepts, attributes, relations, instances, queries, and rules
  • Import and export of Oxml/Flogic/RDF/OWL into the file system and via Web-DAV interface
  • Mapping of concepts, attributes, relations and filtering over attributes
  • Database schema import of MS SQL Server 2000, Oracle 9i, and DB2
  • QueryTool for comfortable editing of Queries
  • Inferencing
  • WebService Import
  • SynonymView to enter and manage synonyms
  • Visualizer

We have further developed two different ontology-enlargement plug-ins for OntoStudio:

  • OntoAnnotate is a document centered semantic annotation framework and can be used for the annotation and enrichment of documents. Thus the editor is able to make proposals for concepts, instances, attributes, and relations.
  • Text2Onto is a framework which supports the ontology engineering process by applying text mining techniques, thus helping the ontology engineer to build or create, and extend an ontology.

Web page: http://www.ontoprise.de/content/e3/e43/index_eng.html

OntoOffice (ontoprise)

ontoprise utilizes the advantages of semantic technologies to get knowledge where it is most needed. Comfortably, the users are not obliged to leave their familiar environment to set a query in external applications, but they get promptly displayed the part of knowledge that is most suitable in the context.

During the work with MS WordTM, MS ExcelTM and MS OutlookTM, OntoOffice automatically checks the input for attendant information stored in the knowledge bases. On demand, the user will access and integrate discovered knowledge in his current work. Additionally, expanded search queries can be done directly from the application to any kind of structured knowledge bases, without leaving the user's familiar environment. Simultaneously the corresponding meta-data is inserted into the document, whereby the quality of future search queries increases.

Web page: http://www.ontoprise.de/content/e3/e24/ontooffice_en

TextToOnto (AIFB)

TextToOnto is a tool suite built upon KAON in order to support the ontology engineering process by text mining techniques. Providing a collection of independent tools for both automatic and semi-automatic ontology extraction it assists the user in creating and extending ontologies. Moreover, efficient support for ontology maintenance is given by modules for ontology pruning and comparison.

Web page: http://sourceforge.net/projects/texttoonto/

Text2Onto (AIFB)

Text2Onto is a complete re-design and re-engineering of our system TextToOnto, a tool suite for ontology learning from textual data. Text2Onto introduces two new paradigms for ontology learning: (i) Probabilistic Ontology Models (POMs) which represent the results of the system by attaching a probability to them and (ii) data-driven change discovery which is responsible for detecting changes in the corpus, calculating POM deltas with respect to the changes and accordingly modifying the POM without recalculating it for the whole document collection.

Web page: http://ontoware.org/projects/text2onto/

SemanticMiner (ontoprise)

The SemanticMiner is a Knowledge Retrieval platform that combines semantic technologies with conventional retrieval approaches. The improved navigation enables the user to easily define semantic queries to all kinds of information sources - especially unstructured documents. Semantic information integration allows for different views and deep analysis of hidden knowledge by the externalization of implicit knowledge. SemanticMiner is designed in client-server architecture. It provides information retrieval in various data sources (e.g. files indexed with an index-server, hypertext-pages reached with a WWW search-engine, and data stored in a database via a DBMS). The SemanticMiner-Server (SMS), which is a specialized OntoBroker-system, provides the interface to the data sources as well as the inference engine to retrieve and present implicit knowledge.

Web page: http://www.ontoprise.de/content/e3/e24/index_eng.html

SEMANTIC ANNOTATION TOOLS

OntoStudio (ontoprise)

During the continuation of the project it has been noticed that a more flexible framework for ontology management is needed that is easy extensible and fulfils the following needs: Smooth integration of own plug-ins, one framework for every ontology-relevant task, scalable architecture, smart functionalities included (like basic style sheets for layout, auto-completion mechanisms, administration, management of the framework.

One open-source framework fulfilled these needs: The Eclispe project of the Eclipse Foundation. In fact, OntoStudio is being built on top of Eclipse and covers the following features:

  • Editing of concepts, attributes, relations, instances, queries, and rules
  • Import and export of Oxml/Flogic/RDF/OWL into the file system and via Web-DAV interface
  • Mapping of concepts, attributes, relations and filtering over attributes
  • Database schema import of MS SQL Server 2000, Oracle 9i, and DB2
  • QueryTool for comfortable editing of Queries
  • Inferencing
  • WebService Import
  • SynonymView to enter and manage synonyms
  • Visualizer

We have further developed two different ontology-enlargement plug-ins for OntoStudio:

  • OntoAnnotate is a document centered semantic annotation framework and can be used for the annotation and enrichment of documents. Thus the editor is able to make proposals for concepts, instances, attributes, and relations.
  • Text2Onto is a framework which supports the ontology engineering process by applying text mining techniques, thus helping the ontology engineer to build or create, and extend an ontology.

Web page: http://www.ontoprise.de/content/e3/e43/index_eng.html

Melita (Sheffield)

Melita is an ontology-based text annotation tool. It implements a methodology with the intent to manage the whole annotation process for the users. It was noticed that several steps in the process which are currently being done manually can be easily automated and handled by the annotation system.

Melita aims to gradually change the role of the user from one of annotator to one of supervisor. The system is pro-active in the sense that it takes the initiative to do any pre-processing which will be used in the future. Melita demonstrates how it is possible to actively interact with the IE system in order to meet the requirements of timeliness and tunable intrusiveness Timeliness refers to the time lag between the moment in which annotations are inserted by the user and the moment in which they are learnt by the Information Extraction system. Normally this happens sequentially in a batch. The Melita system implements an intelligent scheduling in order to keep timeliness to the minimum or practically non-existent in learning without increasing intrusiveness. The Intrusiveness refers to the suggestions which the IE system gives to the user in order to help reduce the burden of annotating tags.

Web page: http://nlp.shef.ac.uk/melita

MnM (Open University)

MnM is a semantic annotation tool which provides manual, automated and semi-automated support for annotating web pages with machine interpretable descriptions. MnM provides knowledge acquisition forms to allow a user to mark up web resources by instantiating generic concepts from a standardized terminology (an ontology) for a particular domain. It allows the user the facility to choose ontologies from a variety of sources and representation languages, including RDF, DAML+OIL and OCML. All of this can be achieved in a familiar web browser environment.

In order to automate the semantic annotation process MnM has a plug-in mechanism which provides access to information extraction systems, such as Amilcare. Thus MnM reduces the effort required to annotate large web resources. MnM is freely available for non-commercial use.

Web page: http://kmi.open.ac.uk/projects/MnM/

OntoMat Annotizer (AIFB)

OntoMat-Annotizer is a user-friendly interactive webpage annotation tool. It supports the user with the task of creating and maintaining ontology-based OWL markups i.e. creating of OWL-instances, attributes and relationships. It include an ontology browser for the exploration of the ontology and instances and a HTML browser that will display the annotated parts of the text annotation It is Java-based and provide a plugin interface for extensions. The intended user is the individual annotator i.e., people that want to enrich their web pages with OWL-meta data. OntoMat allows the annotator to highlight relevant parts of the web page and create new instances via drag'n'drop interactions. It supports the meta-data creation phase of the lifecycle. It contains information extraction plugins (Amilcare, TIES), that offers a wizard which suggest which parts of the text are relevant for annotation. That aspect will help to ease the time-consuming annotation task.

Web page: http://annotation.semanticweb.org/tools/ontomat

 





created and maintained by José Iria