The University of Sheffield
Natural Language Processing Group

Information Access

Building applications to improve access to information in massive text collections, such as the web, newswires and the scientific literature. Subtopics include:

Information Extraction, Text Mining and Semantic Annotation

Information extraction (IE) refers to the activity of automatically recognizing pre-specified sorts of information in short, natural language texts. For instance, one might scan business newswire texts for announcements of management succession events (retirements, appointments, promotions, etc.), identify the names of the participating companies and individuals, the post involved, the vacancy reason, and so on. Or, one might scan biomedical research papers, identify the names of proteins and determine which proteins are engaged in interactions with which other proteins.

Once identified in texts, the specified information may be utilised in various ways. The information may be annotated in the source texts -- so-called semantic annotation -- and used as the basis for semantic search, i.e. for the Semantic Web. Or the information may be extracted from the source texts and stored in a separate structured information repository or database. This structured database may then be used for searching or linking using conventional database queries or analysis using data-mining techniques -- potentially leading to the discovery of novel associations, i.e. text mining. Or the extracted information can be used for generating summaries focused on the extraction targets.

Our contribution   The NLP group has worked intensively on IE-related topics since its inception. The group has produced a wide range of IE systems and components, some of which are freely available with the GATE platform, and has embedded them in prototype applications. In these systems we have investigated techniques ranging from relatively deep, linguistically-motivated knowledge engineering approaches, including full parsing and discourse interpretation using models of domain and world knowledge, to supervised and semi-supervised machine learning approaches, such as support vector machines, that exploit labelled corpora. The group has worked in a variety of domains and application areas, including newswire analysis for competitor intelligence, biomedical research paper analysis to support scientific research, clinical records analysis to support clinical research and patient care.

People

Kalina Bontcheva, Hamish Cunningham, Rob Gaizauskas, Mark Hepple, Diana Maynard, Lucia Specia, Mark Stevenson, Yorick Wilks

Projects

Current and Recent
Past

Question Answering

Open domain question answering (QA) systems aim to support a user who wishes to ask a specific question in natural language and receive a specific answer to that question, where the answer is to be sought in a (potentially huge) collection of natural language texts. QA has become a important application area of natural language processing technologies in the past few years, stimulated by the TREC QA track and more recently the Text Analysis Conferences (TAC).

Rapid advances have been made in developing systems that can answer specific questions, but increasingly it is becoming clear that the questions of most information seekers are not simply of the pub quizz variety (e.g. When was the telephone invented? ), but rather questions where the asker seeks a brief summary or synopsis of facts relating to the question. For example, a question such as Who was Tertullian? cannot be answered in one or two words, but requires a number of related facts, giving for example, Tertullian's nationality, birth date and place, his major achievements, etc. This information may be distributed across multiple documents, many of which will repeat each other in various ways. Thus, QA quite naturally relates to both to information extraction and to multi-document summarization.

Our contribution   The NLP group has developed several QA systems to investigate how useful varying amounts of linguistic knowledge are in QA. We have been regular participants in the TREC QA evaluations and have made contributions to the literature on evaluation of QA systems. We have had a specific interest in investigating the role that information retrieval systems have as a first but critical stage in most QA systems -- retrieving a small set of candidate answer-bearing documents to be intensively analyzed by a second stage answer extraction component -- and have run several workshops on this topic.

People

Rob Gaizauskas, Mark Greenwood, Mark Hepple, Yorick Wilks

Projects

Current and Recent
Past

Summarization

Summarization systems aim to take one or more documents and produce from them a reduced document which contains essential information from the source documents. Of course what constitutes essential information is relative to the goals of the user who wishes to use the summary as a surrogate for the original(s), and varying types of summaries may be produced to meet the needs to various users whose information needs may be represented in various ways.

Summarization is a natural counterpoint to QA, as both are technologies which aim to support information seekers in finding relevant information as efficiently and effectively as possible.

Our contribution   The NLP group has explored various approaches to single and multidocument summarization, including abstractive as well as extractive approaches. The group has participated in the Document Understanding Conference (DUC) and Text Analysis Conference (TAC) summarization system evaluations. The group distributes the SUMMA toolkit, a robust and customisable toolkit for experimenting with various approaches to single and multidocument summarization. We have contributed to the literature on the evaluation of summarization systems and to the construction of resources for evaluation. We have explored techniques for topic-focussed summarization and pioneered work on multidocument summarization for image captioning.

People

Kalina Bontcheva, Rob Gaizauskas,

Projects

Current and Recent
Past