Information AccessBuilding applications to improve access to information in massive text collections, such as the web, newswires and the scientific literature. Subtopics include:
Information Extraction, Text Mining and Semantic Annotation
Information extraction (IE) refers to the activity of automatically recognizing pre-specified sorts of information in short, natural language texts. For instance, one might scan business newswire texts for announcements of management succession events (retirements, appointments, promotions, etc.), identify the names of the participating companies and individuals, the post involved, the vacancy reason, and so on. Or, one might scan biomedical research papers, identify the names of proteins and determine which proteins are engaged in interactions with which other proteins.
Once identified in texts, the specified information may be utilised in various ways. The information may be annotated in the source texts -- so-called semantic annotation -- and used as the basis for semantic search, i.e. for the Semantic Web. Or the information may be extracted from the source texts and stored in a separate structured information repository or database. This structured database may then be used for searching or linking using conventional database queries or analysis using data-mining techniques -- potentially leading to the discovery of novel associations, i.e. text mining. Or the extracted information can be used for generating summaries focused on the extraction targets.
Our contribution   The NLP group has worked intensively
on IE-related topics since its inception. The group has produced a
wide range of IE systems and components, some of which are freely
available with the GATE platform,
and has embedded them in prototype applications. In these systems we
have investigated techniques ranging from relatively deep,
linguistically-motivated knowledge engineering approaches, including
full parsing and discourse interpretation using models of domain and
world knowledge, to supervised and semi-supervised machine learning
approaches, such as support vector machines, that exploit labelled
corpora. The group has worked in a variety of domains and application
areas, including newswire analysis for competitor intelligence,
biomedical research paper analysis to support scientific research,
clinical records analysis to support clinical research and patient
Current and Recent
Question AnsweringOpen domain question answering (QA) systems aim to support a user who wishes to ask a specific question in natural language and receive a specific answer to that question, where the answer is to be sought in a (potentially huge) collection of natural language texts. QA has become a important application area of natural language processing technologies in the past few years, stimulated by the TREC QA track and more recently the Text Analysis Conferences (TAC).
Rapid advances have been made in developing systems that can answer specific questions, but increasingly it is becoming clear that the questions of most information seekers are not simply of the pub quizz variety (e.g. When was the telephone invented? ), but rather questions where the asker seeks a brief summary or synopsis of facts relating to the question. For example, a question such as Who was Tertullian? cannot be answered in one or two words, but requires a number of related facts, giving for example, Tertullian's nationality, birth date and place, his major achievements, etc. This information may be distributed across multiple documents, many of which will repeat each other in various ways. Thus, QA quite naturally relates to both to information extraction and to multi-document summarization.
Our contribution   The NLP group has developed several QA systems to investigate how useful varying amounts of linguistic knowledge are in QA. We have been regular participants in the TREC QA evaluations and have made contributions to the literature on evaluation of QA systems. We have had a specific interest in investigating the role that information retrieval systems have as a first but critical stage in most QA systems -- retrieving a small set of candidate answer-bearing documents to be intensively analyzed by a second stage answer extraction component -- and have run several workshops on this topic.
PeopleRob Gaizauskas, Mark Greenwood, Mark Hepple, Yorick Wilks
Current and Recent
SummarizationSummarization systems aim to take one or more documents and produce from them a reduced document which contains essential information from the source documents. Of course what constitutes essential information is relative to the goals of the user who wishes to use the summary as a surrogate for the original(s), and varying types of summaries may be produced to meet the needs to various users whose information needs may be represented in various ways.
Summarization is a natural counterpoint to QA, as both are technologies which aim to support information seekers in finding relevant information as efficiently and effectively as possible.
Our contribution   The NLP group has explored various approaches to single and multidocument summarization, including abstractive as well as extractive approaches. The group has participated in the Document Understanding Conference (DUC) and Text Analysis Conference (TAC) summarization system evaluations. The group distributes the SUMMA toolkit, a robust and customisable toolkit for experimenting with various approaches to single and multidocument summarization. We have contributed to the literature on the evaluation of summarization systems and to the construction of resources for evaluation. We have explored techniques for topic-focussed summarization and pioneered work on multidocument summarization for image captioning.
PeopleKalina Bontcheva, Rob Gaizauskas,
Current and Recent