|

|
 |
Background:
A Go Annotation Service for Biomedical Literature |
|
|
The
project arms to develop a unified real-time e-Science text-mining
infrastructure that leverages the technologies and methods developed
by both DiscoveryNet
and myGrid.
To date, both projects have developed complimentary methods that
enable the analysis and mining of information extracted from biomedical
text data sources using grid infrastructures, and both projects
have developed methods for using such information to support e-Scientists
in their research. More specifically, the following techniques developed
within both the DiscoveryNet
and myGrid
projects:
1.
Natural Language Processing and Information Extraction techniques
to identify the key terms and entities appearing within retrieved
documents (genes, proteins, diseases, etc), as well as techniques
for biomedical terminology and ontology management;
2. Statistical text and data mining techniques including automatic
document categorization techniques that can assign documents to
predefined categories based on the key entities extracted from
the documents;
3. Grid computing techniques allowing data access and integration,
and for steering the required computational processing;
we aim to use existing text mining technology developed within the
Discovery Net and myGrid projects to build an automated service
that allows users to retrieve biomedical articles from public repositories
and, in real-time, automatically attach semantic descriptors (from
a user-supplied ontology) to both individual articles and to article
sets.
|
 |
Application:
A Go Annotation Service for Biomedical Literature |
|
|
A
major challenge facing scientific researchers today is that of summarising,
analyzing and extracting useful information from the available scientific
knowledge already published by their peers. Those knowledge is available
in scientific literature that is mainly published as unstructured
or free texts. Content providers, sometimes using the controlled
vocabularies and ontologies, like MeSH
(Medical Subject Headings) for Medline
abstracts, to annotate articles, helps users to identify some of
the key concepts appearing in them. Though, the availability of
such controlled vocabularies and ontologies can greatly boost the
productivity of individual scientists, they also suffer some major
drawbacks, such as
-
it
is a time consuming process resulting in a considerable time
lag (several years) between a publication of a document and
the availability of the associated annotations;
-
The
annotations themselves may need revision as new knowledge is
accumulated within the application domain;
-
The
annotations are typically restricted to one controlled vocabulary
used in one domain and do not include annotations or assignments
to other ontologies ;
With the expertise
and tools we developed for both myGrid
and DiscoveryNet,
we believe we can demonstrate an application that automatically
assigns GO identifiers from Gene
Ontology (GO) to biomedical documents in real time.
Such a GO annotator will be based on analyzing features that can
already be extracted automatically from a document, such as known
entities (genes, proteins, disease names) and their relationships,
and will be trained using data existing on manually curated resources.
Such a system would provide an invaluable resource to users who
a)
issue a query to Medline and then want to see results grouped
according to GO classification, and
b) derive a set of texts from the output of a microarray experiment
and then wish to use the GO annotator to group and display the
text according to GO annotations.
By
accomplishing the first instance, i.e. the GO annotation service,
because it is a generic approach, we will demonstrate its feasibility
to generalize to other ontologies as well.
|
 |
System Architecture |
|

|
|