|
 |
|

Ontologies are a key component of both the Semantic Web and
Knowledge Management (KM). In the Semantic Web, they will provide
a machine-interpretable knowledge infrastructure for a large
variety of applications,
including
personal agents and B2B systems. In Knowledge Management, an
ontology acts as a representation of an organization's world view,
as a `corporate memory' and as a tool for encoding corporate
experience and knowledge. An ontology is, in essence, the relevant
Knowledge Representation, a notion always assumed by Artificial
Intelligence (AI) applications. While much has been written on
their application and use, the real challenge lies in constructing
them and keeping them up to date. A third of all major UK firms
are believed to have some form of ontology as the basis of their
corporate KM, yet these are hand-constructed and maintained, and
almost certainly not well understood, by their designers and
users, in terms of structure or function. Building ontologies is
labour intensive, error prone, and much like lexicography: as soon
as the product is ready, it is out of date. All this means that
ontology-associated costs are very high.
This research proposal is a contribution to efforts to automate
the ontology building process. Our starting point is that,
although ontologies attempt to represent the knowledge present in
people's minds, the only easy access we have to what people think,
at least on a large scale, is through the texts they produce. A
number of authors (including ourselves) have attempted to build
ontologies from texts and, in doing so,
have rediscovered the problem known since the earliest days of AI:
the problem is not so much the coding of detailed technical
knowledge as the extraction and representation of the largely
implicit knowledge that underpins our understanding of the world,
what McCarthy called "common sense"
knowledge, to distinguish it from expertise. It is this knowledge
that is not normally written down in texts (e.g. that objects
pushed off an edge fall, that a phone is held to the mouth and
ears etc.) but is assumed for the purpose of their understanding.
Ontologies are "shared conceptualisations" and as
such represent beliefs common to a particular community of
practice. An ontology, in order to be useful, must go through a
continuous cycle of definition, discussion/evaluation, and
revision. This research will contribute to steps in
that development cycle, and will contribute components to larger
more complex ontology development environments.
Although there are now established ontology
representational languages (e.g. DAML, OWL, etc.), there remains a
dispute about the fundamental basis of ontological relations and
how they are to be interpreted.
We do
not want to enter into these issues here, but would argue that the way
forward for
such essentially philosophical disputes is to develop a
defensible, replicable and empirical strategy for the construction
and evaluation of ontologies, whose products will then, hopefully, meet
some
equally objective functional requirement. It is, however, the
former of these goals that is our motive for this proposal.
In the research proposed here, we focus on two complementary
computational tasks. First, we intend to use machine learning and
adaptive information extraction methods to automate the
recognition in texts of explicit environments where ontological
information is expressed. We wish to build on the ideas first
proposed in to explore how effective it is
to train a system on an existing, established, and effective,
scientific ontology, and thus learn how and where in corpora the
terms in the ontology are found in explicit text relations so that
these text relations that bind the terms can be learned. Thus the
system will be
trained to recognise a set of lexico-syntactic contexts that
reflect a particular ontological relationship between terms. These
might include for example:
| Relation |
Pattern
(lexico-syntactic) |
Example |
| HYPERONYMY |
such NP
as {NP, } * {(or|and)}
NP
|
such cars as
the Mercedes C-Class, the Lexus
ES 300 |
|
NP , NP* ,
and other NP |
Ferrari,
Honda, McLaren, Porsche, and other cars |
| MERONYMY |
NP's NP, * |
car's
cooling system/ car's gas tank/ etc. |
The intention is to produce a set of ontological relation +
lexico-syntactic environment pairs which can be used in the
development of new ontologies in a variety of domains, by using
then these derived pairs to search corpora for ontological relations in
a new, untried, domain. We aim to determine what and how much can be
obtained from a corpus of texts in this way. This proposal
represents a significant advance on the standard approach
which is to make up an intuitive list of those
surface contexts reflecting ontological knowledge, and attempt to
build an ontology by searching texts with them.
Finding knowledge that is explicitly and systematically expressed
in a computationally extractable manner is hard. In a
subject-specific (core) corpus, much knowledge will be taken for
granted and therefore implicit. Solving this problem is our second
research task. One approach we plan to investigate is to use
other texts, external to the core corpus, to obtain explicitly
expressed knowledge that is normally not so expressed. An obvious
resource is tutorial texts such as encyclopedias, manuals and
textbooks. These are the types of texts where expert knowledge is
usually made the most explicit along with knowledge usually left
implicit. Using the Internet for this
purpose raises a number of challenges including text
categorisation, identification of co-referents and the evaluation
of the trustworthiness of a source. Nonetheless, the expectation
is that by integrating a number of knowledge sources, the ontology
learning process can be largely, if not completely, automated.
The
Scenario
For the Semantic Web to function there needs to be a rapidly
growing number of domain specific ontologies. In Berners-Lee's
vision and in the opinion of a number of
other authorities,
the Semantic
Web will show its real power in the deployment of intelligent
personal software agents which will handle a variety of tasks for
people. For this to become reality, each software agent must have
an internal representation of the world, or its relevant subset of
the world, in other words an ontology. Thus the physiotherapists'
agents in Berners-Lee's example will have a representation of
knowledge relevant to physiotherapy, patients, appointments etc.
We propose a scenario where such an ontology can be build very
rapidly using as input a set of relevant documents, whether in
free or structured text, and minimising human input as much as
possible. We exploit the advantages of using a variety of strong,
tested, extraction packages, capitalising their ability to: a)
analyse large quantities of texts at high speeds; b) find
regularities and identify all occurrences of a given regularity;
c) cluster words and other patterns into groups; d) establish that
a relationship exists between any given term x and another term y;
and e) access external resources to provide supplementary ontological
knowledge. The ontology developer's role is focussed on the
minimum necessary tasks, i.e. they must i) draft an ontology, or
select or reuse an existing one, and provide this as input to the
system; ii) validate sentences which are exemplars of a particular
relation between two terms; iii) name/label a relation exemplified
in a particular sentence, and to recognise when they encounter
further instances of such a relation; iv) select or accept proposed
external resources; and v) validate or edit the final output. Thus
manual intervention is focussed where it is most effective.
This scenario can provide a much more effective and convincing
approach to knowledge acquisition for the Semantic
Web and for Knowledge Management than the dependence on manual
approaches taken by the majority of researchers in this field.
Aims
and Objectives
The proposed project has the following aims:
- to provide a
methodology for automatically constructing
ontologies needed by the Semantic Web and Knowledge Management
activities.
- Establish the
limits of what kinds of information can be
derived automatically from corpora, what can be derived from other
resources (i.e. texts of a different genre), and what remains
dependent on manual input.
The project will address these issues by pursuing the following
objectives:
- A linguistic
and machine learning approach to extraction from
textual environments which express ontologically relevant information
in
an explicit manner, using existing, attested, ontologies as
training material along with appropriate corpora.
- A number of
algorithms will be developed and implemented
which will support the knowledge engineer in the construction of
ontologies by maximally automating the overall task. These
algorithms are intended to form one or more components or plugins
to an ontology development environments such as KAON. The algorithms
will focus especially on the
following functionalities:
- Using
Sheffield's approach to adaptive information extraction to
hypothesise both the existence of specific relations between the
terms and to label those relations.
- Developing
techniques to aid the knowledge engineer in
identifying relevant sources of ontological knowledge external to
the domain corpus, and permit the integration of externally
derived data into the Knowledge Acquisition process.
- The
construction of data sets that permit the training, testing and
evaluation
of the ontology learning methodology. The data sets will contain
corpora and corresponding
ontologies, constructed both manually and automatically using base-line
methods.
- Techniques will
be developed for evaluating the adequacy of a given ontology
for the representation of the knowledge associated with a domain. This
is virtually
uncharted territory but in a language processing world, driven by, and
to some extent
created by, evaluation, the general need for this missing component is
obvious.
|
|