Report on CLUK 3rd Annual Doctoral Research Colloquium

By John Tait and Aline Villavicencio

This meeting is the third in an annual series of Colloquium organised by Computational Linguistics UK, an informal association of people interested in computational linguistics. Previous events took place at the Universities of Sunderland and Essex. The focus is an opportunity for Doctoral Students to present their work in an informal atmosphere at an early state of development, but there are also invited talks from major names in the field: this year Johanna Moore, Harold Somers and John Carroll. There were 14 submitted papers presented, and over 30 attendees from 13 different universities, all in the UK. However, the voices and faces came from pretty much all round the world.

After a welcome by Donia Scott head of ITRI - Johanna Moore spoke about "Discourse Planning for Tutorial Dialogue System." This was a wide ranging talk, moving from historical work (e.g. on the shortcomings of MYCIN's answers to the question "why") to recent developments in planning, modal logic and more, and their relationship to automatically producing tutorial explanations. It ended on an optimistic note indicating that advances in these related areas should allow significant improvements in quality compared to previous systems.

Sarah Oates (Brighton) gave the first student paper on some corpus based work on discourse markers like "and", "if", "but", "and so therefore" using an RST based system of relations.

Yvonne Canning (Sunderland) described SYSTAR, the syntactic simplification component of PSET, which is an attempt to make newspapers easier to understand for aphasic reader. The talk pointed out the need to do high accuracy coreference tracking to ensure the rearranged text maintains coherence.

Next Afrodite Papagiannopoulou (Essex) presented an attempt to attempt to produce an NL-based Greek Unix assistant using recent ideas on bridging the "generation gap" between text planning and linguistic realisation. This was followed by another talk from Essex, in which Nigel Perez Ramirez outlined a logical model which more completely and naturally accounts for imperatives (like "come here!" or "Go away") and in fact actions.

Stephen Clark (Sussex) then described a fascinating technique for combining corpus derived probability data with Wordnet hierarchies to improve prepositional attachment. The work is comprised by the limitation in size of the Penn Treebank used for training and testing. However, the results are promising.

Ivandré Paraboni (Brighton) put forward an algorithm to appropriately generate within document deictic expressions for example noun phrases referring to parts of pictures, while Elenor Maclaren (Brunel) presenting work undertaken with Chris Reed (Dundee) described some work on stylistics intended to support generation in different styles.

In the last talk of the first day, a similar topic was presented by Daniel Paiva (Brighton), but using Biber's analysis methodology and applied to relating linguistic factors to genre analysis in a limited domain.

In the second invited talk, kicking off the second day, Harold Somers (UMIST) gave an extensive review of the history of and recent developments in Example Based Machine Translation (EBMT). His conclusion seemed to be that EBMT has moved things on, but he remains sceptical.

Freddy Choi (Manchester) gave a very interesting talk on Linear Text Segmentation. His work has close analogues in image retrieval in that it relies on breaking text down into smallest fragments and then progressively merging "similar" regions. The problem is of course defining what is "similar". The algorithm is as yet too poor for practical use, but still has only 12% error rate.

The post coffee session on the second morning focussed on learning. Aline Villavicencio and Benjamin Waldron (Cambridge) gave two papers describing different aspects of the same work which is being undertaken with Ted Briscoe. Villavicencio concentrated on learning word order within a categorical probabilistic grammar framework. Waldron is working on learning "semantics", interpreted so far as I could see as work sense distinction and disambiguation and a syntax to semantics mapping again within a probabilistic framework. This all works surprisingly well, although Gerald Gazdar was critical of an implicit assumption that there are a bounded set of possible human languages. In contrast, Mennon van Zaanen (Leeds), learns bracketings and constituent types using the notion that similar structures can be substituted combined with a minimum edit distance measure of similarity. It can learn recursion and over 85% precision and recall.

The final two submitted talks were what some might call real old fashioned Cog Sci approaches to language: psychological plausibility, story understanding and all that. Elliot Smith (Birmingham) described a story comprehension system which utilises the notion of incoherence of the input text, which is contrasted with more probabilistic approaches. Schank's Scripts and MOPS have been resurrected by Georgios Dimitrios Kalantzis (also Birmingham). He has produced an Integrated Schema Model which overcomes some problems, especially excessive rigidity of the earlier work. However, some aspects of the evaluation were controversial to say the least.

The Colloquium closed with an invited talk by John Carroll who contrasted two approaches which have predominated in parsing in the last two decades: deep processing (HPSG, LFG, etc); and shallow processing with statistical methods. Carroll argued that for some applications, for example some forms of MT and generation, deep processing (of wide-coverage grammars) is needed and this has led to the development of the Lingo grammar, Lexicalised tree grammars, and so on. For efficient parsing with these, it's necessary to have efficient feature structure operations (structure sharing, etc) and also to use some new parsing strategies (key-driven parsing, hyper-active parsing).

In respect to generation: (using minimal recursion semantics) lexical lookup, generation of chart and the addition of modifiers (adjectives) are some of the key issues. Lexicalised tree grammars have proved especially useful here.

There was a marked contrast between the two days. Much, though not all, of the material of the first day could be described as linguistics in the service of natural language engineering (to borrow from Henry Thompson) whereas day two was predominantly computation in the service of linguistics and cognitive science.

Clearly an event like this is not necessarily a representative sample of UK doctoral work. For example there were no submitted papers from at least two of the UK's major centres: Sheffield and Edinburgh. However, based on this sample there is a wide variety of good quality doctoral work going on in the UK.

Thanks are due to the programme and local arrangements committees, especially Carole Tiberius, for an interesting and well organised event. Thanks are also due to the EPSRC for providing financial support, and to ITRI at the University of Brighton for hosting the event.

Planning for next year's event has already started. It will probably take place in Sheffield very early in the New Year.