|
SeminarsCurrent Reading GroupsPast Reading GroupsTemporal and Spatial Information Extraction reading group Machine Learning in NLP Reading Group 2011 - 201219th January, 2012 Maria Liakata Aberystwyth University / European Bioinformatics
Institute (EMBL-EBI), Cambridge - Towards reasoning with scientific articles: identifying conceptualisation zones and beyond
8th December, 2011 Sascha Kriewel Universität
Duisburg-Essen - Introduction to Daffodil / ezDL
The agent-based architecture of the backend can be easily extended to add new services and a tool-based user client can be configured into
different perspectives for specific tasks. Since 2009 the software is being re-implemented as ezDL (easy access to Digital Libraries). EzDL is
currently used within several running projects and provides a platform for user-based evaluations, e.g. within the INEX iTrack.
17th November, 2011 Elaine Toms The University of Sheffield -
Designing the next generation information appliance
10th November, 2011 Ahmet Aker The University of Sheffield - Conceptual
Modelling for Multi-Document Summarization
This paper presents a novel approach to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an
image's location. The summarizer is biased by dependency pattern models towards sentences which contain features typically provided for different scene
types such as those of churches, bridges, etc. Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than
both n-gram language models reported in previous work and also Wikipedia baseline summaries. Summaries generated using dependency patterns also lead to more
readable summaries than those generated without dependency patterns. 3rd November, 2011 Ayman Alhelbawy University of Sheffield -
Disambiguating Named Entities against a Reference Knowledge Base
20th October, 2011 Udo Kruschwitz University of Essex -
Exploiting Implicit Feedback: From Search to Adaptive Search
13th October, 2011 Mark Stevenson University of Sheffield -
Disambiguation of Medline Abstracts using Topic Models
The models generated by LDA consist of sets of terms associated with each topic and these are used to provide context for a Word Sense Disambiguation (WSD) system.
It is found that using this context leads to a statistically significant improvement in the performance of a graph-based WSD system when applied to a standard
evaluation resource in the biomedical domain.
Information about the topic of a document has already been shown to be useful for WSD of Medline abstracts. Previous approaches have relied on using MeSH codes
but these have to be added manually. We demonstrate that information about the topic of abstracts can be identified without the need for manual annotation, by
using an unsupervised technique, and can also be used to improve WSD performance.
6th October, 2011 Chris Dyer Carnegie Mellon University - Unsupervised Word Alignment and Part of
Speech Induction with Undirected Models
Joint work with Noah Smith, Desai Chen, Shay Cohen, Jon Clark, and Alon Lavie
15th September, 2011 Rao Nawab University of Sheffield - External Plagiarism
Detection using Information Retrieval and Sequence Alignment
6th July, 2011 Paola Velardi Universita di Roma -
A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch Slides
In this talk I present a novel graph-based approach aimed at learning a lexical taxonomy automatically, starting from a domain corpus and the Web. Unlike many taxonomy learning approaches in the literature, the algorithm learns both concepts and relations entirely from scratch via the automated extraction of terms, definitions and hypernyms. This results in a very dense, cyclic and possibly disconnected hypernym graph. The algorithm then induces a taxonomy from the graph via optimal branching. Experiments show high-quality results, both when building brand-new taxonomies and when reconstructing WordNet sub-hierarchies.
30th June, 2011 Ann Copestake University of Cambridge - Formal semantics and dependency structures
Logical representations and dependency structures are both used to describe aspects of the meaning of natural language sentences, but are formally very different. In this talk, I will show that one widely used form of logical representation can be transformed into graph structures comparable to dependency representations without loss of information. This has some significant practical advantages for language processing.
23rd June, 2011 Peter Wallis The University of Sheffield - Engineering Spoken Dialogue Systems
Having a conversation with a machine has many commercial applications and has a certain sex appeal for the students. What is more, it is a grand challenge that could provide a unifying theme for much of the departmental research. The dialog manager is, I believe, where there is the greatest opportunity for improvement in spoken dialogue systems and in this talk I contrast my approach with POMDPs. Partially Observable Markov Decision Processes are an elegant approach to the problem of structuring conversation but it is not clear the work being done on them will lead to useful systems. In this talk I argue for an agent based approach to dialogue and provide a set of algorithms from the literature.
9th June, 2011 Internal Research Student Presentations
Niraj Aswani - Evolving a General Framework for Text Alignment: Case Studies with Two South Asian Languages
A gold standard is an essential requirement for automatic evaluation of text alignment algorithms and approaches such as semi-automatic or incremental learning can be used to speed up the process of creating one. In this talk, I will describe a general framework for text alignment that supports manual creation of a gold-standard while in the background updating the language resources used to suggest an initial alignment. In particular, the talk will cover a case study of developing language resources for the English-Hindi language pair. Our focus is on the South Asian languages that are similar to the Hindi language for which the resources are scarce. I will demonstrate the generality of the approach by adapting the resources for the English-Gujarati language pair. Danica Damljanovic - Usability Enhancement Methods in Natural Language Interfaces for Querying Ontologies
Recent years have seen a tremendous increase of structured data on the Web with Linked Open Data project encouraging publication of even more. This massive amount of data requires effective exploitation which is now a big challenge largely because of the complexity and syntactic unfamiliarity of the underlying triple models and the query languages built on top of them. Natural Language Interfaces are increasingly relevant for information systems fronting rich structured data stores such as RDF and OWL repositories, largely because of the conception of them being intuitive for human. Many NLIs to ontologies have been developed, however little work has been done in testing the usability of these systems and the usability enhancement methods which can improve their performance. In this paper, we assess the effect of these methods through the two user-centric studies of the two systems: QuestIO and FREyA. The first study assesses the usability of QuestIO, which is fully automatic, in comparison to the traditional ways of search. The second one assesses the usability of FREyA, which involves the user into loop, with special emphasis on feedback. Our results highlight the expressiveness of the language supported by QuestIO and FREyA, and also the importance of feedback which is shown to improve the overall usability and user experience. In addition, combination of feedback and clarification dialogs in FREyA is shown to outperform the state of the art systems.
2nd June 2011 Piek Vossen Vrije Universiteit Amsterdam - The KYOTO project: a cross- lingual platform for open text mining
The European-Asian project KYOTO developed a platform for mining concepts and events from text across different languages. It uses a layered stand-off representation of text that is shared by 7 languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. The KYOTO Annotation Format (KAF) distinguishes separate layers for structural and semantic aspects of the text that can be stacked on top of each other and that can be extended easily. Once a structural representation of the text in KAF is created, semantic layers are added using modules that work the same for all the languages, creating an interoperable semantic interpretation of the text. The semantic layers are based on wordnet concepts linked to a shared ontology and named entities. 19th May 2011 Internal Research Student Presentations Kumutha Swampillai - Overview of Research Topic Douwe Gelling - Overview of Research Topic
12th May 2011 Leon Derczynski University of Sheffield - Processing Temporal Relations
Language requires a description of time in order to allow use to describe change, to plan, and to discuss history. Temporal information extraction has been a persistently difficult task over the past decade. I will discuss my PhD research in this area and outline a partially data-driven method to extract temporal relations from natural language text, with good results. 5th May, 2011 David Weir University of Sussex - Exploiting Distributional Semantics: exploring asymmetry and non-standard contextual features The distributional hypothesis asserts that words that occur in similar contexts tend to have similar meanings. A growing body of research has been concerned with exploiting the connection between language use and meaning, and much of this work has involved measuring the distributional similarity of words based on the extent that they share similar contexts. In this talk I look at two particular aspects of how distributional similarity can be measured: the value of asymmetry and the choice of co-occurrence features. These issues will be considered in the contexts of various applications, including cross-domain sentiment analysis and detection of non-compositionally. 14th April, 2011 Paul Rayson Lancaster University - Extreme NLP - Co-presenting with Will Simm, Scott Piao and Maria-Angela Ferrario In this talk, we will describe Natural Language Processing research and applications which can be loosely described as 'Extreme NLP'. At Lancaster, there are a number of projects which apply NLP techniques in extreme or harsh circumstances and to controversial or challenging topics. For example, we will describe the problems faced when applying corpus-based NLP methods and tools to historical data (Early Modern English) and to online varieties of language (social networks, emails, blogs). Short texts, informal messages and high volumes of data cause multiple issues for existing tools trained on modern standard varieties of language. The novel application areas such as online child protection, crime, environmental issues, serendipity etc, also mean that it is sometimes difficult to be precise about the exact techniques that are employed. 7th April, 2011 Edward Grefenstette University of Oxford - Categorical Compositionality for Distributional Semantics, Without Tears
Coecke, Sadrzadeh, and Clark (arXiv:1003.4394v1 [cs.CL]) developed a compositional model of meaning for distributional semantics, in which each word in a sentence has a meaning vector and the distributional meaning of the sentence is a function of the tensor products of the word vectors. Abstractly speaking, this function is the morphism corresponding to the grammatical structure of the sentence in the category of finite dimensional vector spaces. In this paper, we provide a concrete method for implementing this linear meaning map, by constructing a corpus-based vector space for the type of sentence. Our construction method is based on structured vector spaces whereby meaning vectors of all sentences, regardless of their grammatical structure, live in the same vector space. Our proposed sentence space is the tensor product of two noun spaces, in which the basis vectors are pairs of words each augmented with a grammatical role. This enables us to compare meanings of sentences by simply taking the inner product of their vectors. 31st March, 2011 Alexander Clark Royal Holloway University of London - Distributional Lattice Grammars: a learnable representation for syntax
A central problem for NLP is grammar induction: the development of unsupervised learning algorithms for syntax. In this paper we present a lattice-theoretic representation for natural language syntax, called Distributional Lattice Grammars. 17th March, 2011 Stephen Clark University of Cambridge - Practical Linguistic Steganography using Synonym Substitution - joint work with Ching-Yun (Frannie) Chang
Linguistic Steganography is concerned with hiding information in a natural language text, for the purposes of sending secret messages. A related area is natural language watermarking, in which information is added to a text in order to identify it, for example for the purposes of copyright. Linguistic Steganography algorithms hide information by manipulating properties of the text, for example by replacing some words with their synonyms. Unlike image-based steganography, linguistic steganography is in its infancy with little existing work. In this talk I will motivate the problem, in particular as an interesting application for NLP and especially generation. Linguistic steganography is a difficult NLP problem because any change to the cover text must retain the meaning and style of the original, in order to prevent detection by an adversary. 10th March, 2011 Internal Research Student Presentations Xingyi Song - Overview of research topic
Daniel Preotius-Pietro - Overview of research topic
Samuel Fernando - Enriching knowledge bases from Wikipedia
Lexical knowledge bases, such as WordNet, have been shown to be useful in a wide range of language processing applications. Enriching such resources using the usual manual approach is costly. This thesis explores methods for enriching WordNet using information from Wikipedia. 3rd March, 2011 John Carroll University of Sussex - Text Mining from User-Generated Content
Over the past five years or so, technology has made it possible for members of the general public to create and publish digital media content, for example in the form of video, audio, or text. Being able to process such content automatically to derive relevant information from it will be of great societal and commercial benefit. In this talk I will present a number of research and commercial applications which I and collaborators are developing, in which we process digital text from sources as diverse as mobile phone text messages, non-native language learner essays, and primary care medical notes. These applications involve a number of language processing challenges, and I will outline how we have overcome them. 24th February, 2011 Leon Derczynski University of Sheffield - ESSLLI course - Word Senses
In an introduction to the tasks of word sense disambiguation and word sense induction, we will discuss a wide range of techniques for the two tasks, from fundamental concepts to state of the art. Further, we survey tools for the development of systems able to participate in past and current evaluation exercises for WSD and WSI (ref: Semeval). 17th February, 2011 Lucia Specia University of Wolverhampton - Quality Estimation for Machine Translation
One of the most popular ways to incorporate Machine Translation (MT) into the human translation workflow is to have humans checking and post-editing the output of MT systems. However, the post-editing of a proportion of the translated segments may require more effort than translating those segments from scratch, without the aid of an MT system. In this talk I will introduce some of my work on quality estimation for MT: the task of predicting the quality of sentences produced by machine translation systems, where "quality" is defined in terms of post-editing effort. A quality estimation system can be used to filter out bad quality translations to prevent human translators spending time post-editing them. I will present the outcomes of experiments with different ways of estimating quality which demonstrate that it is possible to predict post-editing effort using standard machine learning techniques with a relatively small number of training examples and a number of shallow features. 10th February, 2011 Rao Nawab University of Sheffield - Automatic Plagiarism Detection
The task of plagiarism detection using automatic methods has got the attention of the academia, commercial and publishing communities. The main objective of my PhD thesis is to explore the problem of automatically detecting extrinsic plagiarism (when the plagiarized text is created by paraphrasing) using IR and NLP techniques. 3rd February, 2011 Adam Kilgarriff Lexical Computing Ltd. - Using Corpora Without the Pain
Corpora are large objects and querying them efficiently is non-trivial. There are substantial costs to building them, storing them, maintaining them, and building and maintaining software to access them. We propose a model where this work is done by a corpus specialist and NLP systems then use corpora via web services or (if there is a local installation) a command-line API. Our corpus tool is fast, even for billion-word corpora, and offers a wide range of queries via its web API. We have large corpora available for twenty-six languages, and are experts in preparing large corpora from the web, with particular expertise in web text cleaning and de-duplication. To increase our coverage of the world's languages, we have a 'corpus factory' programme. For English, we are building corpora that are both bigger and more richly marked up than others available. The 'big corpus' thread is BiWeC (BIg WEb Corpus) for which we currently have 5.5 billion words fully encoded. The 'more richly marked up' thread is the New Model Corpus, which we are setting up as a collaborative project for multiple annotation. The combination of the API model, the corpora, and the tools, will allow many NLP researchers to use bigger and better corpora in more sophisticated ways than would otherwise be possible. 27th January, 2011 Leon Derczynski University of Sheffield - Review of courses from ESSLLI 2010
Last year, I attended the first week of the European Summer School for Logic, Language and Information. In this talk I will recap briefly over two of the classes taken there. 13th January, 2011 Diana Maynard University of Sheffield - The National Archives: The GATE-way to Government Transparency
In this talk I will describe work we are undertaking in a short project for the National Archives, improving access to the huge volumes of information they are making available as part of the data.gov.uk initiative publishing government-related material in open and accessible forms as linked data. Together with our partners Ontotext, we have developed tools to import, store and index structured data in a scalable semantic repository, making links from regularly crawled web archive data into this repository storing hundreds of millions of documents, and enabling search via semantic annotation. Document annotation is first carried out using GATE, and then indexed via MIMIR, a new massively scalable multiparadigm index that forms part of the GATE and Ontotext product family. 9th December, 2010 Bill Byrne University of Cambridge - Hierarchical Phrase-based Translation with Weighted Finite State Transducers
I will present recent work in statistical machine translation which uses Weighted Finite-State Transducers (WFSTs) to implement a variety of search and estimation algorithms. I will describe HiFST, a lattice-based decoder for hierarchical phrase-based statistical machine translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. I will discuss how improved modelling in translation results from the efficient representation of translation hypotheses and their derivations and scores under translation grammars. We find that the use of WFSTs in translation leads to fewer search errors, better parameter optimisation, improved translation performance, and the ability to extract useful confidence measures under the translation grammar. 8th November, 2010 John Tait Information Retrieval Facility - Slides
7th October, 2010 Danica Damljanovic University of Sheffield - Natural Language Interfaces to Conceptual Models
Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. We study the usability of NLIs from two perspectives: that of the developer who is customising the NLI system, and that of the end-user who uses it for querying. We investigate whether usability methods such as feedback and clarification dialogs can increase the usability for end users and reduce the customisation effort for the developers. To that end, we have developed FREyA - an interactive NLI to ontologies which will be the described and demoed during this talk. 2009 - 20105th August, 2010 David Guthrie University of Sheffield - Storing the Web in Memory: Space Efficient Language Models using Minimal Perfect Hashing
The availability of the text on the web and very large text collections, such as the Gigaword corpus of newswire and the Google Web1T 1-5gram corpus, have made it possible to build language models incorporating counts of billions of n-grams. In this talk we present novel methods for efficiently storing these large models. We introduce three novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams. Our methods use significantly less space than all known approaches and have retrieval speed faster than current language modelling toolkits. 22nd July, 2010 Alberto Diaz Universidad Complutense de Madrid -
In the talk I'll give a short introduction to my research group (members and high levels details about the main research areas), and after I'll explain more details about my research lines and projects. In particular, I'll talk about personalization for digital newspapers through user modelling and text classification tasks, and for text processing for biomedical documents, including text summarization and ICD-9-CM indexing tasks. 8th July, 2010 Laura Plaza (University of Sheffield Visiting Researcher) - Improving Summarization of Biomedical Documents using Word Sense Disambiguation
We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics within the document. To create the graphs, the MetaMap program is used to map the text onto concepts in the UMLS Metathesaurus. This paper shows that applying a graph-based Word Sense Disambiguation algorithm to the output of MetaMap improves the quality of the summaries that are generated. 24th June, 2010 Ronald Denaux (University of Leeds) -
Ronald will first present his work on involving domain experts in ontology engineering through the use of the Rabbit controlled natural language, a tailored ontology engineering methodology and a tailored user interface based on Protege (this all in the context of the Confluence project in a collaboration between the Ordnance Survey and the University of Leeds). In the second part, Ronald will present his current work on Multi-perspective Ontology Engineering where he is investigating a mechanism for capturing the perspective of ontology authors in order to enhance tool support for ontology creation and reuse. In particular, Ronald is working on formalising the purpose of ontologies and eliciting the goals of ontology authors through dialogue games (the second part is in the context of Ronald's PhD). 17th June, 2010 Hector Llorens (University of Sheffield Visiting Researcher) - Temporal information extraction using semantic roles and semantic networks
In the last years, there has been an intensive research on the temporal elements of natural language text. TimeML scheme has been recently adopted as the standard for annotating temporal expressions (TIMEX3), events (EVENT), and their relations ([T,A,S]LINK). This research analyzes the advantages of applying semantic information to the automatic annotation of TimeML elements. For that purpose, a system addressing the automatic annotation of TimeML elements is presented. The system implements an approach which uses semantic roles and semantic networks as additional information extending classic approaches based on morphosyntactic information. A multilingual analysis carried out evaluating the system for Spanish demonstrated the approach is valid for different languages achieving same quality results and improvement over classic approaches. In the talk, I will include an "application proposal" which I intend to develop during my stay there and which will be the application of my thesis. Yours and your group suggestions and feed back about my current and further work will be of great value for me. 30th April, 2010 Atefeh Farzindar (NLP Technologies Inc) - Successful cooperation between the university and industry
NLP Technologies and RALI (Applied Research in Computational Linguistics, Université de Montréal) have developed an automated monitoring system for the automatic summarization and translation of legal decisions. During this seminar, Atefeh Farzindar, will discuss the successful cooperation between the university and industry leaders, a milestone in applied research and technology transfer. Experience shows that when industry players combine their strengths and work alongside university experts with the same vision, the result yielded is by far greater than what can be achieved separately. She will present her experience with domain-based technologies in the legal and military fields. 22nd April, 2010 Miles Osbourne (University of Edinburgh) - What is happening now? Finding events in Massive Message Streams
Social Media {eg Twitter, Blogs, Forums, FaceBook} has exploded over the last few years. FaceBook is now the most visited site on the Web, with Blogger being the 7th and Twitter the 13th. These sites contain the aggregated beliefs and opinions of millions of people on an epic range of topics, and in a large number of languages. Twitter in particular is an example of a massive message stream and finding events embedded in it poses hard engineering challenges. I will explain how we use a variant of Locality Sensitive Hashing to find new stories as they break. The approach scales well, easily dealing with the more than 1 million Tweets a day we process and only needing a single processor. For June 2009, the fastest growing stories all concerned deaths of one kind or another. 15th April, 2010 Peter Wallis (University of Sheffield) - Conversation in Context: what should a robot companion say?
Language as used by humans is a truly amazing thing with multiple roles in our lives. Academics have tended to focus on the way languages convey meaning, and disciplines that come new to the problem such as computer science tend to start with reference semantics and progress to models of meaning that look mathematical and hence solidly academic. Language as used is however beautifully messy. People sing, they lie and swear, they use metaphor and poetry, play word games and talk to themselves. Is there a better way to look at language? Interdisciplinary research is hard not only because each discipline has its own terminology, but also because they usually have different interests. Those of us interested in spoken language interfaces (computer science) however have a shared interest with applied linguistics in how language works in situ. This paper outlines a theory about how language works from applied linguistics and shows how the theory can be used to guide the design of a robot companion. 25th March, 2010 Adam Funk (University of Sheffield) - Ontology-Based Categorization of Web Services with Machine Learning
We discuss the problem of categorizing web services according to a shallow ontology for presentation on a specialist portal. We treat it as a text classification problem and apply first information extraction techniques (using keywords and rules), then machine learning (ML), and finally a combined approach in which ML has priority over keywords. The techniques are evaluated according to standard measures for flat categorization as well as the Balanced Distance Metric for ontological classification and compared with related work in web service categorization. The ML and combined categorization results are good and the system is designed to take users' contributions through the portal's Web 2.0 features as additional training data. 18th March, 2010 Elena Lloret (University of Alicante) - Text Summarization and it's Applications in NLP Tasks
Text Summarization, which aims to condense the information contained in one or more documents and present it in a more concise way, can be very useful for helping users to manage the large amounts of information available due to the rapid growth of the Internet. In this talk, I will present the Natural Language Processing and Information Systems Research Group of the University of Alicante (Spain), and next I will focus on Text Summarization as the research topic of my PhD. I will describe a knowledge-based approach to generate extractive summaries, and how this approach has been successfully applied to neighbouring NLP tasks, such as Question Answering, Sentiment Analysis or Text Classification. Finally, some issues regarding the difficult task concerning the evaluation of summaries will be also outlined, suggesting preliminary ideas of new directions for the evaluation task. 26th February, 2010 René Witte (Concordia University in Montréal) - Software Engineering and Natural Language Processing: Friends or Foes?
This talk will investigate some connections between software engineering (SE) and natural language processing (NLP). It will attempt to answer questions such as "Why do software engineers use natural language artifacts everywhere, but no NLP?" and "Why, after more than 10 years of modern NLP research, do we still not have the most basic NLP functionalities integrated into our desktops?". In the first part, we examine NLP for SE: Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. However, while source code artifacts are well-managed by today's software development tools, documents are not integrated on a semantic level with their corresponding code artifacts. This results in a number of problems, like the loss of traceability links between code and its documentation (requirements specifications, user guides, design documents). We show how natural language processing approaches can be used to retrieve semantic information from software documents and connect them with source code using ontology alignment techniques. The second part of the talk will investigate the integration of existing NLP techniques (such as summarization or question-answering) into end-user desktop programs (such as email clients or word processors). This work is motivated by the observation that none of the impressive advances in NLP and text mining over the last decade has materialized in the tools and desktop environments in use today. The "Semantic Assistants" project aims to provide effective means for the integration of natural language processing services into existing applications, using an open service-oriented architecture based on OWL ontologies and W3C Web services. 25th February, 2010 Claude Roux (Xerox Research Labs) - TBA
4th February, 2010 Peter Wallis (University of Sheffield) - High Recall Search in Practice
Internet search engines do an amazing thing, but what they can do well has coloured our view of the general problem of search. There are cases where a search engine would be better if the searcher knew he or she had found everything relevant, but how often and how significant these cases are is an open question. One popular notion is that high recall is not that useful as we can get by without it. Although sound reasoning, it does not mean there is not a opportunity to be had - Xerox faced this marketing problem with the photocopier and jet engines had been in use for quite a while before the advantages were quantifiable. One situation where the need for high recall is acknowledged is defence intelligence. Defence has both the will and resources to develop bespoke systems for their particular needs and in this talk I describe, in some detail, the needs of the "Health Intelligence" community. I go on to describe how we addressed these needs using an Information Extraction system based on a library of "Fact Extractors". 17th December, 2009 Jose Iria (University of Sheffield) - Machine Learning Approaches to Text and Multimedia Mining
Today's search engines are able to retrieve and index several billion web pages, but the analysis that they perform on the content of these pages is still very shallow -- as is, consequently, the functionality that they are able to offer the user. What if these search engines could, for example, extract the factual content from the pages they retrieve, classify the pictures that accompany the text, disambiguate namesakes or mine opinions expressed in the pages? Undoubtably, this would open a world of possibilities in what concerns new functionalities and enhanced user experience, fueled by richer underlying data models. In this talk, I will describe my research, spanning a number of years, on these topics. The common denominator in the several approaches that I will present is the fact that they rely heavily on machine learning techniques, to train systems to classify and extract target information. The talk will also overview real-world applications of the systems originating from the research -- for instance, in one case we trained one of our systems to extract information from a collection of jet engine reports provided by Rolls-Royce, resulting in a positive impact in the way their engineers search for information in the course of their work. 15th December, 2009 Donia Scott (University of Sussex) - Summarisation and Visualization of Electronic Health Records
10th December, 2009 Roberto Navigli (Universita di Roma "La Sapienza") - Comparing Graph Connectivity Measures for Word Sense Disambiguation
Word sense disambiguation (WSD), the task of identifying the intended meanings (i.e. senses) for words in context, has been a long-standing research objective for Natural Language Processing. While supervised systems typically achieve better performance, they require large amounts of sense-tagged training instances. An alternative solution is that of adopting knowledge-based approaches, that exploit existing knowledge resources to perform WSD and do not need annotated training sets. In this talk, we present an objective comparison of graph-based algorithms for alleviating the data requirements for large-scale WSD. Under this framework, finding the right sense for a given word amounts to identifying the most "important" node among the set of graph nodes representing its senses. We present a variety of measures that exploit the connectivity of graph structures, thereby identifying the most relevant word senses. We assess their performance on standard datasets, and show that the best measures perform comparably to state-of-the-art systems. We also provide interesting insights into the relevance of the underlying knowledge resource on WSD performance. 26th November, 2009 Serge Sharoff (University of Leeds) - Classifying the Web into Domains and Genres
The jungle metaphor is quite common in corpus studies. The subtitle of David Lee's seminal paper on genre classification is 'navigating a path through the BNC jungle'. According to Adam Kilgarriff, the BNC is a jungle only when compared to smaller Brown-type corpora, while it looks more like an English garden when compared to the Web. At the moment we know little about the domains and genres of webpages. In the seminar I'm going to talk about approaches to understand the composition of the Web as a corpus. 19th November, 2009 Luke Zettlemoyer (University of Edinburgh) - Learning to Follow Orders: Reinforcement Learning for Mapping Instructions to Actions
In this talk, I will address the problem of relating linguistic analysis and control --- specifically, mapping natural language instructions to executable actions. I will present a reinforcement learning algorithm for inducing these mappings by interacting with virtual computer environments and observing the outcome of the executed actions. This technique has enabled automation of tasks that until now have required human participation --- for example, automatically configuring software by consulting how-to guides. Our results demonstrate that this method can rival supervised learning techniques while requiring few or no annotated training examples. 29th October, 2009 Allan Ramsay (Univeristy of Manchester) - Using English to Express Commonsense Rules
The talk will discuss some issues arising from an attempt to provide natural language access to a body of simple information about diet and its effect on various common medical conditions. Expressing this knowledge in natural language has a number of advantages. It also raises a number of difficult issues. I will outline the reasons why it seemed like a good idea and the reasons why it is difficult, and sketch our solution to these problems. 15th October, 2009 Diana Maynard (University of Sheffield) - Using Lexico-Syntactic Patterns for Ontology Enrichment: the case of ODd SOFAS
This talk describes the use of information extraction techniques involving lexico-syntactic patterns to generate ontological information from unstructured text and augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgeneration that typically occurs with general patterns. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Both make use of a new plugin for GATE which generates ontologies on the fly. Furthermore, we integrate support for ontology lifecycle development via a change log mechanism that enables logging of ontology versions and application of changes from one version to another. 1st October, 2009 Trevor Cohn (Univeristy of Sheffield) - Bayesian Non-Parametric Models for Parsing and Translation Slides
Many natural language processing tasks require inference over partially observed input data. Traditionally these models are trained using the expectation maximisation (EM) algorithm. However, for many models EM finds poor or degenerate solutions. Bayesian methods provide a elegant and theoretically principled way to address these problems, by including a prior over the model and integrating over uncertain events. In this talk I'll describe how we developed non-parametric Bayesian models for two related tasks: 1) learning a tree substitution grammar (DOP) for syntactic parsing and 2) learning a grammar-based machine translation model. The models learn compact and simple grammars, uncovering latent linguistic structures and in doing so outperform competitive baselines. 2008 - 200914th May, 2009 Sivaji Bandyopadhyay (Jadavpur University, India) - Emotion Analysis in Blog texts
Emotion analysis on blog texts is being carried out for a less privileged language like Bengali. A set of six attitude types, namely, happy, sad, anger, fear, disgust and surprise, have been selected toward this emotion detection task for reliable and semi automatic annotation of the blog texts. An automatic classifier has been applied for recognizing six basic types of attitudes for different words of a sentence. Different scoring strategies have been applied to identify sentence level emotion type based on the acquired word level emotion information. Unsupervised techniques have been applied on the classified test output to improve the accuracy. Same method has been applied on English SemEval 2007 Affect Sensing corpus that has given satisfactory performance. 7th May, 2009 Leon Derczynski (University of Sheffield) - Sequencing of Events and Their Durations Based on Event Descriptions Slides Temporal Information Extraction is the elicitation of accurate data on events in a discourse. This specifies both tense and aspect of actions, both explicitly given by text and implicit from world knowledge. Events can occur at any point along a timeline, and are often only loosely specified in terms of upper and/or lower bounds relative to other events. Being able to identify and annotate times in discourse enables us to build a richer representation of the knowledge present in text. Given a document - for example, a news article - only a subset of facts within that document ever hold true at any one time. For example, we cannot concurrently assert "The silver and black Scott bike was chained to railings" and "An hour later it was gone". Extracting and temporally linking information is the only way to know which sets of facts hold true at the same time. A brief summary of literature and models surrounding tense and temporal location will be presented, followed by a review of recent work in the field. We will look at the normalisation of temporal data (anchoring vague expressions to a fixed interval on an absolute time scale), how events in text relate to each other and ways of reasoning about them, and different representations of temporal data - logical, textual and visual. 30th April, 2009 Marta Sabou (Open University) - Exploiting Semantic Web Ontologies: An Experimental Report Slides As a side effect of the Semantic Web research activities, a large collection of ontologies is now available online constituting one of the largest and most heterogeneous knowledge sources in the history of AI. In this talk we report on the characteristics of this novel source and on its successful use for relation discovery. Our experiments show that, in the context of an ontology matching task, relations between the concepts of two ontologies can be discovered with a precision of 70% when using online ontologies. We conclude by exploring the potential of this novel knowledge resource for language technology applications. 16th April, 2009 Kumutha Swampillai (University of Sheffield) - Inter-Sentential and Intra-Sentential Relations in IE Corpora Some information extraction systems are limited to extracting binary relations from single sentences. This constraint means that relations occurring across sentence boundaries cannot possibly be extracted by such systems. We examine the distribution of inter-sentential and intra-sentential relations in the MUC6 and ACE03 corpora. It was found that inter-sentential relations constitute 31.4% and 9.4% of the total number of relations in MUC6 and ACE03 respectively. These results show a 69.6% and a 90.6% recall upper bound of single sentence approaches to relation extraction. As such, any comprehensive approach to relation extraction will have to treat linguistic units larger than a sentence. 2nd April, 2009 Danica Damljanovic (University of Sheffield) - Natural Language Interfaces to Conceptual Models: Usability and Performance Slides Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SeRQL or SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain or ontology. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. Many methods are under development to reduce this training, and increase the usability of NLIs. We have developed Question-based Interface to Ontologies (QuestIO) which translates Natural Language text-based queries to SeRQL/SPARQL queries, which are then executed against the given ontology/knowledge base and the results are shown to the user. Customisation of this system is performed automatically from the ontology vocabulary. QuestIO is quite flexible in terms of complexity and syntax of the supported queries, as both keyword-based searches and full blown questions are supported. However, in the user-centric evaluation of this system we have noticed that the performance was degraded as the users did not have suficient help from the interaction with the system. In this talk, we propose combination of the three methods which are used to assist the user while interacting with the system: feedback, creating personalised vocabulary, and query refinement, and how these can be used in combination to improve the usability of NLIs to conceptual models. 19th March, 2009 Peter Wallis (University of Sheffield) - Social Engagement with Robots and Agents (SERA) Slides Getting people to engage with robotic and virtual artifacts is easy, but keeping them engaged over time is hard: robots and agents lack some fundamental capabilities which can be summarized as sociability. The research community has realized the problem, but approaches, so far, have been dispersed and disjoint. If robots and agents are to become companions in people's lives, they will have to blend into these lives seamlessly. SERA is innovative in that it addresses sociability holistically, by advancing knowledge about what sociability in robots and agents entails, by developing methodology to analyze and evaluate it, and by making available research resources and platforms. SERA will, to this purpose, undertake real-life extended field studies of users' engagement with robotic devices. Sociablity has also to be built into robot and agent architectures from scratch and the goal here is to implement an architecture that caters for both background (cultural, normative etc.) and situational individual (theory of mind, adaptivity, responsiveness) practices and needs of users, with the guiding principle of pervasive affectivity. Assistive robots and agents that are to become true companions have to be versatile in functionality and identity (style, personality) depending on the service they are required to deliver, such as (reactive) social mediators, as (in turn reactive and proactive) information assistants, or as (proactive) coaches or monitors e.g. with health-related tasks. SERA will develop pilots of such intertwined interactive service applications for a robotic device. 12th March, 2009 Chris Huyck (Middlesex University) - A Pyscholinguistic Model of Natural Language Parsing Implemented in Simulated Neurons Slides One of the central activities in natural language processing is parsing. There are a wide range of engineering solutions to parsing but none perform at human levels. The understanding of how humans process language is far from complete, but there is little doubt that humans use their neurons for all mental activities including parsing. There are several psychological models of parsing, but this talk will describe the first neuro-psychological model of parsing. That is, the parser is implemented entirely in simulated neurons. It makes use of Hebb's Cell Assembly hypothesis to form the basis of memories including words, clauses and sentences. Neural parsers require variable binding, and this parser binds via short-term potentiation. The parser produces correct semantic output. As neural cycles have an associated time, time can be measured, and the parser parses in times similar to humans. Prepositional phrase attachment ambiguities are resolved based on the semantics of the sentence. Finally, the parser is embedded in a functioning agent. 5th March, 2009 Monica Schraefel (University of Southampton) - The Path to Joyful Interaction or Why doesn't your computer make you happy? The common computing interaction paradigm is task oriented and task silo'd. We go to a specific application that supports a specific task and do that specific thing. There is some boundary crossing within applications - calendars and address books share data; email is forced into being as flexible as a paper notebook, spreadsheets can be linked into word processing documents. Yet perhaps not too many would say they feel particularly empowered by their computers; that their quality of life is enhanced by interacting with these machines. There are several ways at least in which we might consider why this lack of joy and delight is the more usual experience of computers in our world. One may be this sense of having to do too many things FOR the computer in order for it to do things for us. Another may be that even when it has the information, it does not DO what we want with it. It is functionally obtuse. Another may be that the cost of trying to explain what to do is simply too high for the benefit that might accrue. In the past year or so, a few of us have been looking at some of these problems that appear to be quite light weight issues, and yet have been substantial road blocks towards delightful computing. We have been prototyping some approaches to explore new interactions and new types of services that might be both practically effective in freeing us from serving the computer to get on with our own missions, and may, in so doing, serve to enhance our quality of life along the way. In this talk, I'll go over some of these projects, the motivation behind them and how far we've gotten on the path to joyful computing and the perfect digital assistant. 26th February, 2009 Mark Stevenson (University of Sheffield) - Disambiguation of Biomedical Text Slides Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of these texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including the context in which the ambiguous term is used and domain-specific resources (such as UMLS). We compare a range of knowledge sources which have been previously used and introduce a novel one: MeSH terms. The best performance is obtained using linguistic features in combination with MeSH terms. Performance exceeds previously reported results on a standard test set. Our approach is supervised and therefore relies on annotated training examples. A novel approach to automatically acquiring additional training data, based on the relevance feedback technique from Information Retrieval, is presented. Applying this method to generate additional training examples is shown to lead to a further increase in performance. 19th February, 2009 Mark A. Greenwood (University of Sheffield) - IR4QA: An Unhappy Marriage Slides Over a decade of recent question answering (QA) research has relied on using off-the-shelf information retrieval (IR) engines in order to find relevant documents from which exact answers can be extracted. In this talk I will explain why most QA systems follow this approach and summarise the recent research into what has become known as IR4QA. It is becoming increasingly clear, however, that the use of IR within QA systems is nothing more than a marriage of convenience: in general, QA researchers don't want to develop IR engines and IR researchers are not interested in the QA task. I believe that this marriage is doomed and will never lead to the production of high performance QA systems. The second half of the talk will highlight the main problems inherent in modern QA systems which use IR engines and suggest some possible avenues that QA research may take in the future. 12th February, 2009 Ehud Reiter (University of Aberdeen) - BabyTalk: Generating English Summaries of Clinical Data Slides I will give an overview of the BabyTalk project, whose goal is to generate English summaries of complex clinical data from a neonatal intensive care unit, for doctors, nurses, parents, and other family members. BabyTalk is based on the hypothesis that a textual summary of the most important information in a data set can in some cases be more useful than a visualisation which presents all of the data, or a expert system which explicitly gives advice based on the data. I will primarily focus on NLP challenges in BabyTalk, such as generating good narratives and effectively communicating temporal information. I will also present the results of our first evaluation, which were mixed but overall quite encouraging. 5th February, 2009 Julien Bourdon (Kyoto University) - Language Grid: An Infrastructure for Intercultural Collaboration Slides The Language Grid is an on-line multilingual service platform which enables easy registration and sharing of language services such as on-line dictionaries, bilingual corpora, and machine translations. Unlike existing machine translation systems, the Language Grid allows users to register and combine user-created dictionaries and bilingual corpora with existing machine translations to realize user-oriented translation programs with greater accuracy. The main goals of this project are to combine the existing standard language services provided by linguistic professionals and to assist users to create new language services for their own purpose by permitting them to add their own language resources to the ones made by professionals. Currently, services such as translators, dictionaries, parallel texts, morphological analysers, concept dictionaries, available in 10 languages are deployed on the Language Grid. The Language Grid is used for applications such as multilingual collaboration in NPOs, intercultural coexistence in Japanese schools or hospitals.
4th December, 2008 Diana McCarthy (University of Sussex) - Evaluating Lexical Inventories and Disambiguation Systems with Lexical Substitution Slides There has been a surge of interest within Computational Linguistics over the last decade into methods for word sense disambiguation (WSD). A major catalyst has been the series of SENSEVAL evaluation exercises which have provided standard datasets for the field. Whilst researchers believe that WSD will ultimately prove useful for applications which need some degree of semantic interpretation; the jury is still out on this point. One significant problem is that there is no clear choice of inventory for any given task, other than the use of a parallel corpus for a specific language pair for a machine translation application. Many of the evaluation datasets produced, certainly in English, have used WordNet. Whilst WordNet is a useful resource, it would be beneficial if systems using other inventories could enter the WSD arena without the need for mappings between the inventories which may mask results. This is particularly important since there is no consensus that WordNet sense distinctions are the right ones to make for any given application. As well as the work in disambiguation, there is a growing interest in automatic acquisition of inventories of word meaning. It would be useful to investigate the merits of predefined inventories themselves, aside from their use for disambiguation, and compare these with inventories which have been acquired automatically. In this talk I will discuss these issues and some results in the context of the English Lexical Substitution Task, organised by myself and Roberto Navigli (University of Rome, "La Sapienza") last year under the auspices of SEMEVAL. 27th November, 2008 David Guthrie (University of Sheffield) - Unsupervised Detection of Anomalous Text Slides, PhD Thesis Situations abound that rely on the ability of computers to detect differences from what is normal or expected. Credit card companies identify possible fraud by detecting spending patterns that differ from what is 'normal' for a given cardholder and network analysts detect possible attacks by spotting network traffic that is out of the ordinary. The focus for this talk is the development of unsupervised technologies to similarly detect anomalies in text. We use the term "anomalous" to refer to text that is irregular, or unusual, with respect to the writing style in the majority of a text. In this talk we show that identifying such abnormalities in text can be viewed as a type of outlier detection because these anomalies will deviate significantly from their surrounding context. We consider segments of text which are anomalous with respect to topic (i.e. about a different subject), author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Several different innovative approaches to this problem are introduced and we present results over large document collections, created to contain randomly inserted anomalous segments. 18th November, 2008 Seemab Latif (University of Manchester) - Novel Automatic Technique for Linguistic Quality Assessment of Students' Essays Using Automatic Summarizers Slides In this seminar, I will be talking about the experiments that have addressed the calculation of inter-annotator inconsistency in selecting the content in both manual and automatic summarization of sample TOEFL essays. A new finding is that the linguistic quality of source essay has a very strong positive correlation with the degree of disagreement among human assessors to what should be included in a summary. This leads to a fully automated essay evaluation technique based on degree of disagreement among automated summarizes. ROUGE evaluation is used to measure the degree of inconsistency among the participants (human summarizers and automatic summarizers). This automated essay evaluation technique is potentially an important contribution with wider significance. 6 November, 2008 Niraj Aswani (University of Sheffield) - Tools for Alignment Tasks Slides For some tasks, such as text alignment and cross-document co-reference resolution, one would need to refer to more than one document at the same time. Hence, a need arises for Processing Resources (PRs) which can accept more than one document as parameters. For example, given two documents, a source and a target, a Sentence Alignment PR would need to refer to both of them to identify which sentence of the source document aligns with which sentence of the target document. Similarly for a cross-document co-reference resolution, the respective PR would need to access both the documents simultaneously. The standard behaviour of the GATE PRs contradicts the above mentioned requirements. GATE PRs process one document at a time. Corpus pipeline which accepts a corpus as input, considers only one document at a time. Having said this it is not impossible to make PRs accepting more than one document but this would require a lot of re-engineering. Recently, we have introduced a few new resources in GATE (e.g. CompoundDocument, CompositeDocument, AlignmentEditor etc.) to address these issues. In this short presentation, I will describe these components and show how to use them. 28 October, 2008 Rob Gaizauskas (University of Sheffield) - Generating Image Captions using Topic Focused Multi-document Summarization Slides
21 October, 2008 Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA. Mark A. Greenwood (University of Sheffield) - Evaluation of Automatically Reformulated Questions in Question Series Slides Having gold standards allows us to evaluate new methods and approaches against a common benchmark. In this paper we describe a set of gold standard question reformulations and associated reformulation guidelines that we have created to support research into automatic interpretation of questions in TREC question series, where questions may refer anaphorically to the target of the series or to answers to previous questions. We also assess various string comparison metrics for their utility as evaluation measures of the proximity of an automated system's reformulations to the gold standard. Finally we show how we have used this approach to assess the question processing capability of our own QA system and to pinpoint areas for improvement. 14 October, 2008 - Jordi Poveda (UPC Catalunya) - A Combination of Machine Learning Methods for the Recognition of Temporal Expressions Slides Time expression recognition and representation of the time information they convey in a suitable normalized form is a central part of Information Extraction (IE), for it paves the way for the extraction of events and temporal relations. The most common approach to time expression recognition in the past has been the use of handmade extraction rules (grammars), which also served as the basis for normalization. Our aim is to explore the possibilities afforded by applying machine learning techniques to the recognition of time expressions, in order to see where it stands in relation to grammar-based approaches. We focus on recognizing the appearances of time expressions in text (not normalization) and transform the problem into one of chunking, where the aim is to correctly assign IOB tags to tokens. We explain will the knowledge representation used and compare the results obtained in our experiments with two different supervised methods, one statistical (support vector machines) and one of rule induction (FOIL), where the superiority of SVMs is revealed. Next, we will present a semi-supervised approach (based on bootstrapping) to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping. The only supervision is in the form of seed examples, hence it becomes necessary to resort to heuristics to rank and filter out spurious patterns and candidate time expressions. We will summarize our preliminary result with this bootstrapping architecture, which is currently in a testing and improvement stage . The ultimate benefit of developing an end-to-end machine-learning-based framework for information extraction is that it can be carried to new domains and tasks with little customization. |