Real-time Text Mining for the Biomedical Literature: A Collaboration between DiscoveryNet & myGrid
Home People Publications Contact Demo
Background
  The project's background and goal
Application
 
A Go Annotation Service for Biomedical Literature
Architecture
  System Design


Background: A Go Annotation Service for Biomedical Literature

The project arms to develop a unified real-time e-Science text-mining infrastructure that leverages the technologies and methods developed by both DiscoveryNet and myGrid.

To date, both projects have developed complimentary methods that enable the analysis and mining of information extracted from biomedical text data sources using grid infrastructures, and both projects have developed methods for using such information to support e-Scientists in their research. More specifically, the following techniques developed within both the DiscoveryNet and myGrid projects:

1. Natural Language Processing and Information Extraction techniques to identify the key terms and entities appearing within retrieved documents (genes, proteins, diseases, etc), as well as techniques for biomedical terminology and ontology management;
2. Statistical text and data mining techniques including automatic document categorization techniques that can assign documents to predefined categories based on the key entities extracted from the documents;
3. Grid computing techniques allowing data access and integration, and for steering the required computational processing;


we aim to use existing text mining technology developed within the Discovery Net and myGrid projects to build an automated service that allows users to retrieve biomedical articles from public repositories and, in real-time, automatically attach semantic descriptors (from a user-supplied ontology) to both individual articles and to article sets.

Application: A Go Annotation Service for Biomedical Literature

A major challenge facing scientific researchers today is that of summarising, analyzing and extracting useful information from the available scientific knowledge already published by their peers. Those knowledge is available in scientific literature that is mainly published as unstructured or free texts. Content providers, sometimes using the controlled vocabularies and ontologies, like MeSH (Medical Subject Headings) for Medline abstracts, to annotate articles, helps users to identify some of the key concepts appearing in them. Though, the availability of such controlled vocabularies and ontologies can greatly boost the productivity of individual scientists, they also suffer some major drawbacks, such as

  • it is a time consuming process resulting in a considerable time lag (several years) between a publication of a document and the availability of the associated annotations;
  • The annotations themselves may need revision as new knowledge is accumulated within the application domain;
  • The annotations are typically restricted to one controlled vocabulary used in one domain and do not include annotations or assignments to other ontologies ;

 

With the expertise and tools we developed for both myGrid and DiscoveryNet, we believe we can demonstrate an application that automatically assigns GO identifiers from Gene Ontology (GO) to biomedical documents in real time.

Such a GO annotator will be based on analyzing features that can already be extracted automatically from a document, such as known entities (genes, proteins, disease names) and their relationships, and will be trained using data existing on manually curated resources. Such a system would provide an invaluable resource to users who

a) issue a query to Medline and then want to see results grouped according to GO classification, and
b) derive a set of texts from the output of a microarray experiment and then wish to use the GO annotator to group and display the text according to GO annotations.
By accomplishing the first instance, i.e. the GO annotation service, because it is a generic approach, we will demonstrate its feasibility to generalize to other ontologies as well.

System Architecture


©2005 NLP group, University of Sheffield
Home • PeoplePublicationsContactDemo