10 July, 2007 - Simon Overell (Imperial College) Proposing a geographic co-occurrence model as a tool for GIR slides
The motivation behind developing such a tool is to improve performance on Geographic Information Retrieval problems such as placename disambiguation (if "Sheffield" appears in text, which Sheffield is it?) and geographic relevance (if "Sheffield" appears in a query are "Yorkshire", "Manchester" or "Derby" relevant?). The talk will cover the development of a geographic co-occurrence model mined from Wikipedia and similar user-generated content. The co-occurrence model is similar to a language model, however, contains only geographic entities. The accuracy and clarity of the co-occurrence model are also quantified. The talk will begin with a description of how Wikipedia can be mined for named-entity associations and the area Geographic Information Retrieval, followed by details of the co-occurrence model and its application. The talk will conclude with future directions and applying the described techniques to the CLEF corpora. Slides
3 July, 2007 - Steve Young (Cambridge University) Using POMDPs for Spoken Dialogue Management
Modelling dialogue management as a Markov Decision Process offers many potential advantages including the ability to learn dialog strategies from data, increased robustness to noise and on-line adaptation. However, attempts to exploit MDPs in real systems have met with limited success primarily due to the fact that they cannot model the uncertainty which is inherent in all spoken dialogue systems.
This talk will explain how partially observable Markov Decision Processes (POMDPs) can provide a principled mathematical framework for modelling the inherent uncertainty in spoken dialog systems. It briefly summarises the basic mathematics and explains why exact optimisation is intractable. It then describes a form of approximation called the Hidden Information State model which does scale and which can be used to build practical systems.
19 June, 2007 - Katerina Pastra (ILSP, Athens) REVEAL THIS and the COSMOROE cross-media relations framework Slides1, Slides2
Media convergence and the ever increasing availability of digital audiovisual content have intensified research on multimedia content processing; the recent boom of online video search services and TV and radio content monitoring and aggregation are just some attempts to help not only professional users but also laymen in making sense out of bulks of multimedia content. In the first part of this talk, we will present REVEAL THIS, an FP6 STREP project on multimedia and multilingual digital content processing; we will demonstrate the research prototype developed within this project, a prototype that is addressed to the general, everyday users and which offers search, retrieval, summarization and translation functionalities for audiovisual content. In an attempt to conquer the semantics within audiovisual data, a number of REVEAL THIS modules implement methods for fusing the pieces of information expressed by different modalities in the same document. However, multimedia content forms a distinct type of discourse, one that goes beyond textual discourse and has its own idiosyncrasies; developing modules that will process this discourse successfully and will be scalable requires a) a “language” to talk about and describe semantic links between different modalities and b) corpora annotated with such links for developing/training the modules. Could this “language” be the one provided in the Rhetorical Structure Theory, as suggested in the literature? Could multimedia corpora be annotated with such relations? In the second part of this talk, we will explore these questions and we will present COSMOROE, an alternative corpus-based cross-media relations framework that was built to capture semantic interrelations between images, language and body movements. COSMOROE is currently being tested for descriptive power and computational applicability through its use for annotating a corpus of TV travel programmes with cross-media relation-metadata; we present all particulars of the annotation process and conclude with a discussion on the usability and scope of such annotated corpora. Slides1, Slides2
12 June, 2007 - Swati Gupta (University of Sheffield) Generating Politeness in Task Based Interaction: An Evaluation of the Effect of Linguistic Form and Culture
Politeness is an integral part of human language variation, e.g. consider the difference in the pragmatic effect of realizing the same communicative goal with either “Get me a glass of water mate!” or “I wonder if I could possibly have some water please?” This paper presents POLLy (Politeness for Language Learning), a system which combines a natural language generator with an AI Planner to model Brown and Levinson’s theory of politeness (B&L) in collaborative task oriented dialogue, with the ultimate goal of providing a fun and stimulating environment for learning English as a second language. An evaluation of politeness perceptions of POLLy’s output shows that: (1) perceptions are generally consistent with B&L’s predictions for choice of form and for discourse situation, i.e. utterances to strangers need to be much more polite than those to friends; (2) our indirect strategies which should be the politest forms, are seen as the rudest; and (3) English and Indian native speakers of English have different perceptions of the level of politeness needed to mitigate particular face threats.
5 June , 2007 - Genevieve Gorrell (University of Sheffield) - Generalized Hebbian Algorithm for Dimensionality Reduction in Natural
Language Processing Slides
The current surge of interest in search and comparison tasks in natural language processing has brought with it a focus on vector space approaches and vector space dimensionality reduction techniques. Presenting data as points in hyperspace provides opportunities to use a variety of well-developed tools pertinent to this representation. Dimensionality reduction allows data to be compressed and generalised. Eigen decomposition and related algorithms are one category of approaches to dimensionality reduction, providing a principled way to reduce data dimensionality that has time and again shown itself capable of enabling access to powerful generalisations in the data. Issues with the approach, however, include computational complexity and limitations on the size of dataset that can reasonably be processed in this way. Large datasets are a persistent feature of natural language processing tasks, with techniques such as statistical language modelling relying for their success on huge amounts of data. In the presentation the Generalized Hebbian Algorithm will be introduced, which is an algorithm that allows the eigen decomposition of a dataset to be learned from a series of observations. The advantages of the approach from the point of view of applicability will be discussed. Secondly, several novel extensions to the algorithm will be presented that extend its applicability. Thirdly, promising initial results will be presented from applying the extended algorithm to the task of smoothing stochastic language models. Slides
29 May, 2007 - Goran Nenadic (University of Manchester) - Mining hypotheses from biomedical literature Slides
Discovering new links and relationships is one of the main challenges in biomedical research, as biologtists are interested in uncovering entities that have similar functions, take part in the same or related processes or are co-regulated. This is typically done by mining experimental data. In this talk I will discuss the prospects and challenges of extracting semantically related entities from biomedical literature, in order to support hypothesis generation. For this task we are currently exploring a combination of various text-based features (including lexical, syntactical and contextual profiles) assigned to biomedical entities with suitable kernels for estimating similarities between them. The results of initial experiments will be presented. Slides
25 May, 2007 - Ellen Campana (University of Rochester) - Comparing "Natural" and "Standardized" Approaches to Dialogue System DesignSlides
The design of spoken dialogue systems is dominated by two approaches, which I call “the natural approach” and “the standardized approach”. The natural approach takes as a gold standard human-human interaction, while the standardized approach advocates design decisions that reduce computational complexity while providing the user with consistent language to facilitate learning and adaptation. I argue that there is no need to assume, a priori, that either approach will be more useful to all people in all situations. Comparing the two approaches involves pitting powerful human language capacities against powerful human learning abilities, pitting the distribution of language a person has produced and understood over the course of a lifetime against a smaller and more constrained situation-specific distribution, and possibly even pitting automatic language mechanisms against higher-level perspective-taking mechanisms. Until now, there has been little empirical research directly comparing the two approaches in terms of usability, in part due to lack of an evaluation metric that is directly related to ease-of-use, yet fine-grained enough to be used at the utterance-level. In this research I demonstrate how it is possible to extend a classic tool from Cognitive Psychology, the dual-task paradigm, to system evaluation in order to address this question. I focus specifically on the generation of referring expressions because 1) referring expressions are fundamental to all spoken communication, and 2) the two approaches advocate different methods. I this talk I will present behavioral research investigating the roles of both discourse context and visual context in system generation of referring expressions. Slides
8 May, 2007 - Bill Clocksin (Oxford Brookes University) - A High Accuracy Character Recognition System for Old Printed Books in Connected Script languages such as Syriac and Arabic
This talk describes recently developed software for converting page images from old printed books (c1600-c1900) into machine-readable text form. The system does not require a lexicon. The only language-specific information built into the system are the orthographic rules common to all connected script languages. The script-specific language information required is a template for each character in each variant, together with a symbol model for diacriticals, vowel marks, and other punctuation. The output from the system is a no-loss transliteration including all the diacritical marks. This transcription can then be post-processed to any desired format such as a standard transliteration or a modern Syriac or Arabic font edition.
The system has been designed specifically with the requirements of scanning old printed books. These sources offer new challenges to OCR. Old printed books, typeset by hand, have significant variation compared to modern typesetting. Connection rules are usually broken, and diacritical marks are placed with wide variation. The system has been used for transliterating text in the three major Syriac scripts: Estrangelo, West Syriac, and East Syriac. Texts were scanned copies of old manually typeset works kindly provided by Beth Mardutho, the Syriac Institute. In addition, the generality of the system has been tested by transliterating modern printed text in a Geeza-style Arabic font from the standard DARPA corpus.
Performance details on historical manually typeset text are as follows. Character recognition rate is near 100%. Diacritical recognition rate is usually above 99%. Transliteration speed ranges from 25-40 words per minute depending on the complexity of the script. The only factors that limit recognition rate to less than 100% have been found to be a rare few severely broken characters, and a few diacritics too small to be identified.
1 May, 2007 - Zornitsa Kozareva (University of Valencia) - The Role and Resolution of Textual Entailment in Natural Language Processing
A fundamental phenomenon in Natural Language Processing concerns language variability. Identifying that two text fragments express the same meaning is a challenging problem known as textual entailment. In this talk, we will discuss the role of textual entailment in several Natural Language Processing applications such as Question Answering, Information Extraction, Information Retrieval and Text Summarization. We will make a brief overview of the currently developed approaches and then we will present our machine-learning approximation for the resolution of textual entailment.
We model lexical and semantic attributes which capture the insequence, overlap and skip n-gram information, as well as the inter-syntactic word-to-word semantic similarity, the synonym and antonym relations, and the relevant domain information between two text fragments. In order to integrate this information effectively, we explore stacking and voting classifier combination schemes.
17 April, 2007 - Harold Somers (university of Manchester) - Latest Developments in Machine Translation
MT Wars II: The empire strikes back? Slides
In this talk, a survey will be given of the major developments in MT during the last decade, with special attention for the interaction of rule-based and stochastic approaches (example-based and statistical MT). I will summarize developments first in EBMT then in SMT, then ask to what extent the two approaches are converging. In each case also I will look at the extent to which "linguistics", however defined, is used in these approaches. I will briefly (attempt to) present two recent classifications of MT systems, by Wu and Carl. Slides
20 March, 2006 - François Mairesse (University of Sheffield) - PERSONAGE: Personality Generation for Dialogue. slides
Over the last fifty years, the "Big Five" model of personality traits has become a standard in psychology, and research has systematically documented correlations between a wide range of linguistic variables and Big Five traits. A distinct line of research has explored methods for automatically generating language that varies along personality dimensions. While this work suggests a clear utility for generating personality-rich language: (1) these generation systems have not been evaluated to see whether they produce recognizable personality variation; (2) they have primarily been based on template-based generation with limited paraphrases for different personality settings; (3) the use of psychological findings has been heuristic rather than systematic. We present PERSONAGE (PERSONAlity GEnerator), a language generator with 29 parameters previously shown to correlate with extraversion, an important aspect of personality. We explore two methods for generating personality-rich language: (1) direct generation with particular parameter settings suggested by the psychology literature; and (2) overgeneration and selection using statistical models trained from judge’s ratings. An evaluation shows that both methods reliably generate utterances that vary along the extraversion dimension, according to human judges. Slides
13 March, 2006 - Jamie Henderson (University of Edinburgh) - Hybrid Reinforcement/Supervised Learning of Dialogue Policies.Slides
Reinforcement learning usually implies collecting new data as learning progresses, which can be very expensive with human-computer dialogue systems. On the other hand, supervised learning is not entirely appropriate for training dialogue management policies, because we want to learn a new policy, not simply mimic the systems which were used to collect the data. In this talk I will present a method for applying RL to learning a dialogue management policy using a fixed set of dialogue data. This method uses supervised learning to model the range of policies for which we can expect to get reasonably accurate estimates from the data. RL is then used to find a policy within this range which maximises a given measure of dialogue reward. We test this method within the framework of Information State Update (ISU)-based dialogue systems, using linear function approximation to handle the large state space efficiently. We trained this model on the COMMUNICATOR corpus, to which we have added annotations for user actions and Information States. When tested with a user simulation trained on the same dataset, our hybrid model outperforms the systems in the COMMUNICATOR data, and outperforms a pure supervised learning model. When tested with human users, the hybrid model outperforms a hand crafted policy. We see this work as important for bootstrapping and automatic optimisation of dialogue management policies from limited initial datasets. (Work with Oliver Lemon and Kallirroi Georgila). Slides
6 March, 2007 - Ramesh Krishnamurty (Aston University) - Corpora - from language description and lexicography to language teaching and learning: the ACORN (Aston Corpus Network) project Slides
From the 1960s to 1990s, corpora were used mainly for research in language description and lexicography. In the past decade, they are increasingly being
used for language teaching and learning. This talk will discuss some of the changes involved, from the practical perspective of the ACORN project at Aston University: http://corpus.aston.ac.uk Slides
27 February, 2007 - Udo Kruschwitz (University of Essex) - Towards Adaptive Information Retrieval - Step 1: Collecting Real Data
One of the most exciting areas of research in search engine technology and information retrieval is the move towards "adaptive" search systems. A particularly promising aspect of this wide field is to move log analysis right in the centre of attention. The challenge is to exploit the user interaction (as recorded in the log files) to make the search system adapt to the users' search behaviour. Instead of looking at the Web in general we are interested in smaller document collections with a more limited range of topics.
We are focusing on a search paradigm where automatically extracted domain knowledge is incorporated in a simple dialogue system in order to assist users in the search process. In task-based evaluations we could show that a search engine that does not simply return the results but instead offers the user suggestions to widen or narrow down the search has the potential of being a much more useful tool. Automatically constructed knowledge can however never be as good as manually created structures. The challenge is to mine the log files in order to automatically improve the suggestions made by the system, in other words to "adapt" to the users' search behaviour. We are interested in a specific aspect of this search behaviour, namely the selection of query modification terms which provides us with "implicit feedback" from the users and should be sufficient to come up with a model to automatically adjust the domain knowledge without having to rely on other forms of explicit or implicit user feedback.
This whole process requires real data. We have made a start by running a prototype of our own search system that combines a standard search engine with automatically extracted domain knowledge. The system has been running on the University of Essex intranet for more than 6 months and we have collected more than 25,000 queries. The log files we keep collecting are an extremely valuable resource because they are a reflection of real user interests (different to TREC like scenarios which are always a bit artificial). The data collected so far are a justification for a system that guides a user in the search process: more than 10% of user queries are query modification steps, i.e. the user either replaces the initial query or adds terms to the query to make it more specific. Adding a term happens more often than replacing the query with a completely new one. We also observe that a user is more likely to select one of the suggestions made by our search engine than modifying the query manually. The talk will focus on our ongoing research and present some preliminary analysis of the log files collected so far.
6 February , 2007 - Ralph Steinberger/Bruno Pouliquen (JRC-IPSC) - NewsExplorer – Multilingual News Analysis with Cross-lingual Linking. Slides
Overview of software tools to provide multilingual and cross-lingual information access (without Machine Translation (MT) or the use of bilingual dictionaries).The European Union is a highly multilingual environment with its twenty three official languages (recently Irish, Romanian and Bulgarian, 253 language pairs).
Our approach consists of first producing a language-independent representation, based mostly on subject domains, normalised named entities and cognates, and to then apply a similarity measure to this interlingua representation.
- We use the multilingual Eurovoc thesaurus to classify multilingual documents into the same subject domains.
- Place names are first disambiguated and then represented by their geographical co-ordinates.
- Person and organisation names are normalised to achieve an automatic match of the name variants found in different languages.
We will describe the challenges for each individual software component (e.g. homography and other types of ambiguity; word variants in highly inflected languages) and will present the adopted solutions to these problems.
A demonstration of the publicly accessible, fully-automatic news aggregation and analysis system NewsExplorer (http://press.jrc.it/NewsExplorer) will show that the presented approach is workable and that information gathering can be enhanced by combining information extracted from texts in many different languages.
Additionally we will present other applications:
- name relations (statistical extracted correspondence between persons in news)
- quotation detection
- Event extraction (man made violent events in the English news)
- Medisys (Medical open source news analysis)
30 January, 2007 - Marilyn Walker (University of Sheffield) - Learning to Generate Naturalistic Utterances for Spoken Dialogue
Systems by Mining User Reviews slides
Spoken language generation for dialogue systems requires a dictionary of mappings between semantic representations of concepts the system wants to express and realizations of those concepts. Dictionary creation is a costly process; it is currently done by hand for each dialogue domain. This talk describes a novel weakly supervised method for learning such mappings from user reviews in the target domain, and experiments using the method on restaurant reviews. We test the hypothesis that user reviews that provide individual ratings for distinguished attributes of the domain entity make it possible to map review sentences to their semantic representation with high precision. Experimental analyses show that the mappings learned cover most of the domain ontology, and provide good linguistic variation. Subjective user evaluations show that the consistency between the semantic representations and the learned realizations is high, and that the naturalness of the realizations is higher than a hand-crafted baseline. Slides
11 January, 2007 - Adam Kilgariff (Lexicography MasterClass Ltd) - Googleology is bad science slides
The web is enormous and largely linguistic. As we discover, on ever more fronts, that language analysis and generation benefit from big data, so it becomes appealing to use the web as a data source. The question, then, is how. The straightforward approach is to use a search engine such as Google. For seven years now, researchers have been using Google, Altavista and other commercial search engines to collect data and statistics from the web. We discuss the pros and cons of this approach and present the case for the alternative, in which we, the NLP/MT community, crawl and index the web ourselves. slides
28 November, 2006 - Ehud Reiter (Aberdeen University) - Data-To-Text: Generating English Summaries of Data.
Data-to-text systems generate textual summaries of non-linguistic data. For example, such systems can generate textual weather forecasts from numerical weather predictions, and textual medical summaries from clinical medical data. I will present an overview of why data-to-text is useful, and how data-to-text systems work. I will highlight linguistic pragmatic issues which are perhaps more important in data-to-text than in most other NLP applications, such as choosing the best word(s) to communicate data values in a context. See below URL for more details about our work: http://www.csd.abdn.ac.uk/~ereiter/data2text.html
21 November, 2006 - Angelo Dalli (University of Sheffield/ University of Malta) - Automatic Dating of Documents and Temporal Text Classification.
The frequency of occurrence of words in natural languages exhibits a periodic and a non-periodic component when analysed as a time series. This work presents an unsupervised method of extracting periodicity information from text, enabling time series creation and filtering to be used in the creation of sophisticated language models that can discern between repetitive trends and non-repetitive writing patterns. The algorithm performs in O(n log n) time for input of length n. The temporal language model is used to create rules based on temporal-word associations inferred from the time series. The rules are used to guess automatically at likely document creation dates, based on the assumption that natural languages have unique signatures of changing word distributions over time. Experimental results on news items spanning a nine year period show that the proposed method and algorithms are accurate in discovering periodicity patterns and in dating documents automatically solely from their content.
31 October, 2006 - Peter Wallis (University of Sheffield) - Revisiting the DARPA Communicator Data using Conversation Analysis
The DARPA Communicator project was a large scale evaluation of state-of-the-art human-computer dialogue systems. Five years on, it seems the trials did not lead to commercial or military exploitation of the technology on a scale envisaged at the time. This paper describes work in which I used conversation analysis to look at why. The paper provides a survey of CA methodology, and puts it in the context of the wider movement of ethnomethodological studies. The conclusion is that a) mixed initiative at the discourse level is required so that humans can help in the repair process, and b) that CA can be used to identify the strategies needed for effective repair.
24 October, 2006 - Rada Mihalcea (Univeristy of Oxford) - Random Walks on Text Structures
Since the early ages of artificial intelligence, associative or semantic networks have been proposed as representations that enable the storage of language units and the relationships that interconnect them, allowing for a variety of inference and reasoning processes, and simulating some of the functionalities of the human mind. The symbolic structures that emerge from these representations correspond naturally to graphs -- relational structures capable of encoding the meaning and structure of a cohesive text, following closely the associative or semantic memory representations. The activation or ranking of nodes in such graph structures mimics to some extent the functioning of human memory, and can be turned into a rich source of knowledge useful for several language processing applications.
In this talk, I will present a framework for the application of graph-based ranking algorithms implementing random-walk models to structures derived from text, and show how the synergy between graph-theoretical algorithms and graph-based text representations can result in efficient unsupervised methods for several natural language processing tasks. I will illustrate the framework with several text processing applications, including word sense disambiguation, extractive summarization, and keyphrase extraction. I will also outline a number of other applications that can find successful solutions within this framework, and conclude with a discussion of opportunities and challenges for future research.
17 Ocotber, 2006 - Joao Magalhaes (Imperial College London) - Semantic Multimedia Information: Mining, Fusion and Extraction slides
The extraction of semantic information from multimedia content is a research area that faces multiple challenges: scalability; data scarcity; specific statistical models for each modality; computational limitations when processing large-scale training datasets; incorrect ground truth... To address some of the issues hindering multimedia retrieval applications we propose a novel learning framework to extract semantic multimedia information. The framework combines both knowledge and statistical data, and it is divided in three parts: (1) multimedia mining, (2) multi-modal information fusion, and (3) semantic information extraction. We will discuss several aspects of the framework, such as, scalability, its solid statistical foundation (borrowed from Generalized Linear Models and Information Theory), how it is able to elegantly cope with different modalities, and its performance on semantic image retrieval and large-scale semantic video retrieval. Slides
10 October , 2006 - Steve Whittaker and Simon Tucker, (Sheffield University) - Temporally Compressing Meeting Records slides
Although speech is a potentially rich information source, a major barrier to using it is the extreme tedium of accessing lengthy speech recordings. This talk will describe efforts to develop and evaluate novel techniques for temporal compression that reduce the amount of time taken to listen to a recording whilst still retaining the important content. We use both acoustic (silence removal, speedup) and semantic (insignificant word or phrase removal) methods to allow users to focus on important material in the recording. Our first study examined subjective user evaluations of 8 of compression techniques. From this we identified promising techniques which we then examined more rigorously in a second experiment. Techniques which make use of semantic properties are promising; techniques which remove less significant words or phrases allow listeners to extract the gist of a meeting record in a reduced amount of time. If time permits, we will also describe initial efforts to apply similar techniques to the design of interfaces to support textual summarisation. Slides
12 September, 2006 - John Barnden (University of Birmingham) - Metaphor and Metonymy: A Practical Deconstruction slides
Defining and distinguishing between metaphor and metonymy has long been a contentious matter. The problems are ever more pressing, for empirical reasons and problems with the notion of domains, as noted by a number of authors, and for system-building reasons related to these difficulties. In my work on building a knowledge/reasoning-based metaphor understanding system (ATT-Meta) I have abandoned domains as providing defining qualities of metaphor and metonymy, and indeed have concluded that ``metaphor'' and ``metonymy'' are just rough labels, of heuristic utility only. Instead of metaphor and metonymy themselves, what is theoretically and operationally important is a collection of different dimensions along which different ``metaphorical'' or ``metonymic'' utterances can lie. These dimensions are concerned with matters such as the extent and type of mapping involved, the hypotheticality of source items, and the extent to which the links to source items are part of the meaning. Although prototypical cases of metaphor and typical cases of metonymy may lie largely in different regions of the space defined by the dimensions, metaphor and metonymy in general can overlap on each dimension separately. Another feature of the approach is that metaphoricity, metonymicity and positioning of utterances on the underlying dimensions are not objective qualities but are, instead, language-user-relative, although there can be rough agreement between language users in many cases. Although the approach has not yet been subject to an extensive empirical study, it is plausible that it is relatively friendly to handling metaphor and metonymy satisfyingly in empirical work on language. Slides.