Relation Extraction using Semi-Supervised Learning Techniques

Supported by the EPSRC, from 1st November 2004 - April 2007.

Investigator: Mark Stevenson
Researcher: Mark A. Greenwood


Natural Language Processing (NLP) techniques are increasingly finding a variety of novel applications in new domains and are now expected to deal with a wide variety of text types. A Grand Challenge to NLP at the moment is to develop techniques which allow systems to be rapidly adapted with minimal expert intervention. A good example of this need for flexibility can be found in the field of Information Extraction (IE). Systems here are now required to extract diverse forms of information from a wide variety of texts including web pages and biomedical journal abstracts. Experience has shown that the previously favoured knowledge engineering methods for IE system development often produce systems which perform well but are extremely brittle and difficult to port even for the expert. One way of avoiding these limitations is to make use of machine learning (ML) techniques to help with the application development process. This proposal aims to develop methods for using ML to create IE systems in a semi-automatic manner.

The RESuLT project can be summarised as the creation and testing of a system for extracting relational information from text. In this context ``relation extraction'' means the identification of entities with well-defined connections between them, for example, the names of sports people and the teams for which they play or the names of companies and their locations. In order to make this system more usable we shall implement a semi-supervised learning algorithm which makes use of a small set of example relations and generalises from them. Lexical information from WordNet will be used to guide the generalisation process in a linguistically principled way. The implemented system will be tested on at least two evaluation regimes to verify its portability; in addition, one of these has been previously used to evaluate relation extraction and will allow a direct comparison of performance.


The main aims of this project are to advance the state of the art in the extraction of relational information from text and to apply this technology to a number of scenarios. The project will address these aims by carrying out the following objectives: