Logo Abraxas


Ontologies are a key component of both the Semantic Web and Knowledge Management (KM). In the Semantic Web, they will provide a machine-interpretable knowledge infrastructure for a large variety of applications, including personal agents and B2B systems. In Knowledge Management, an ontology acts as a representation of an organization's world view, as a `corporate memory' and as a tool for encoding corporate experience and knowledge. An ontology is, in essence, the relevant Knowledge Representation, a notion always assumed by Artificial Intelligence (AI) applications. While much has been written on their application and use, the real challenge lies in constructing them and keeping them up to date. A third of all major UK firms are believed to have some form of ontology as the basis of their corporate KM, yet these are hand-constructed and maintained, and almost certainly not well understood, by their designers and users, in terms of structure or function. Building ontologies is labour intensive, error prone, and much like lexicography: as soon as the product is ready, it is out of date. All this means that ontology-associated costs are very high.

This research proposal is a contribution to efforts to automate the ontology building process. Our starting point is that, although ontologies attempt to represent the knowledge present in people's minds, the only easy access we have to what people think, at least on a large scale, is through the texts they produce. A number of authors (including ourselves) have attempted to build ontologies from texts and, in doing so, have rediscovered the problem known since the earliest days of AI: the problem is not so much the coding of detailed technical knowledge as the extraction and representation of the largely implicit knowledge that underpins our understanding of the world, what McCarthy called "common sense" knowledge, to distinguish it from expertise. It is this knowledge that is not normally written down in texts (e.g. that objects pushed off an edge fall, that a phone is held to the mouth and ears etc.) but is assumed for the purpose of their understanding.
Ontologies are "shared conceptualisations" and as such represent beliefs common to a particular community of practice. An ontology, in order to be useful, must go through a continuous cycle of definition, discussion/evaluation, and revision. This research will contribute to steps in that development cycle, and will contribute components to larger more complex ontology development environments. Although there are now established ontology representational languages (e.g. DAML, OWL, etc.), there remains a dispute about the fundamental basis of ontological relations and how they are to be interpreted. We do not want to enter into these issues here, but would argue that the way forward for such essentially philosophical disputes is to develop a defensible, replicable and empirical strategy for the construction and evaluation of ontologies, whose products will then, hopefully, meet some equally objective functional requirement. It is, however, the former of these goals that is our motive for this proposal.

In the research proposed here, we focus on two complementary computational tasks. First, we intend to use machine learning and adaptive information extraction methods to automate the recognition in texts of explicit environments where ontological information is expressed. We wish to build on the ideas first proposed in to explore how effective it is to train a system on an existing, established, and effective, scientific ontology, and thus learn how and where in corpora the terms in the ontology are found in explicit text relations so that these text relations that bind the terms can be learned. Thus the system will be trained to recognise a set of lexico-syntactic contexts that reflect a particular ontological relationship between terms. These might include for example:
Relation Pattern (lexico-syntactic) Example
such NP as {NP, } * {(or|and)} NP
such cars as the Mercedes C-Class, the Lexus ES 300

NP , NP* , and other NP Ferrari, Honda, McLaren, Porsche, and other cars
MERONYMY NP's NP, * car's cooling system/ car's gas tank/ etc.

The intention is to produce a set of ontological relation + lexico-syntactic environment pairs which can be used in the development of new ontologies in a variety of domains, by using then these derived pairs to search corpora for ontological relations in a new, untried, domain. We aim to determine what and how much can be obtained from a corpus of texts in this way. This proposal represents a significant advance on the standard approach which is to make up an intuitive list of those surface contexts reflecting ontological knowledge, and attempt to build an ontology by searching texts with them.

Finding knowledge that is explicitly and systematically expressed in a computationally extractable manner is hard. In a subject-specific (core) corpus, much knowledge will be taken for granted and therefore implicit. Solving this problem is our second research task. One approach we plan to investigate is to use other texts, external to the core corpus, to obtain explicitly expressed knowledge that is normally not so expressed. An obvious resource is tutorial texts such as encyclopedias, manuals and textbooks. These are the types of texts where expert knowledge is usually made the most explicit along with knowledge usually left implicit. Using the Internet for this purpose raises a number of challenges including text categorisation, identification of co-referents and the evaluation of the trustworthiness of a source. Nonetheless, the expectation is that by integrating a number of knowledge sources, the ontology learning process can be largely, if not completely, automated.

The Scenario

For the Semantic Web to function there needs to be a rapidly growing number of domain specific ontologies. In Berners-Lee's vision and in the opinion of a number of other authorities, the Semantic Web will show its real power in the deployment of intelligent personal software agents which will handle a variety of tasks for people. For this to become reality, each software agent must have an internal representation of the world, or its relevant subset of the world, in other words an ontology. Thus the physiotherapists' agents in Berners-Lee's example will have a representation of knowledge relevant to physiotherapy, patients, appointments etc.
We propose a scenario where such an ontology can be build very rapidly using as input a set of relevant documents, whether in free or structured text, and minimising human input as much as possible. We exploit the advantages of using a variety of strong, tested, extraction packages, capitalising their ability to: a) analyse large quantities of texts at high speeds; b) find regularities and identify all occurrences of a given regularity; c) cluster words and other patterns into groups; d) establish that a relationship exists between any given term x and another term y; and e) access external resources to provide supplementary ontological knowledge. The ontology developer's role is focussed on the minimum necessary tasks, i.e. they must i) draft an ontology, or select or reuse an existing one, and provide this as input to the system; ii) validate sentences which are exemplars of a particular relation between two terms; iii) name/label a relation exemplified in a particular sentence, and to recognise when they encounter further instances of such a relation; iv) select or accept proposed external resources; and v) validate or edit the final output. Thus manual intervention is focussed where it is most effective.

This scenario can provide a much more effective and convincing approach to knowledge acquisition for the Semantic Web and for Knowledge Management than the dependence on manual approaches taken by the majority of researchers in this field.

Aims and Objectives

The proposed project has the following aims:
  1. to provide a methodology for automatically constructing ontologies needed by the Semantic Web and Knowledge Management activities.
  2. Establish the limits of what kinds of information can be derived automatically from corpora, what can be derived from other resources (i.e. texts of a different genre), and what remains dependent on manual input.
The project will address these issues by pursuing the following objectives:
  1. A linguistic and machine learning approach to extraction from textual environments which express ontologically relevant information in an explicit manner, using existing, attested, ontologies as training material along with appropriate corpora.
  2. A number of algorithms will be developed and implemented which will support the knowledge engineer in the construction of ontologies by maximally automating the overall task. These algorithms are intended to form one or more components or plugins to an ontology development environments such as KAON. The algorithms will focus especially on the following functionalities:
    1. Using Sheffield's approach to adaptive information extraction to hypothesise both the existence of specific relations between the terms and to label those relations.
    2. Developing techniques to aid the knowledge engineer in identifying relevant sources of ontological knowledge external to the domain corpus, and permit the integration of externally derived data into the Knowledge Acquisition process.
  3. The construction of data sets that permit the training, testing and evaluation of the ontology learning methodology. The data sets will contain corpora and corresponding ontologies, constructed both manually and automatically using base-line methods.
  4. Techniques will be developed for evaluating the adequacy of a given ontology for the representation of the knowledge associated with a domain. This is virtually uncharted territory but in a language processing world, driven by, and to some extent created by, evaluation, the general need for this missing component is obvious.

created and maintained by Christopher Brewster and José Iria