PhD Projects


For a list of awarded PhD's and a downloadable version (where available) please click here
Milan Agatonovic (Supervisor: Prof. Hamish Cunningham


Ayman Alhelbawy (Supervisor: Prof. Rob Gaizauskas)


Ahmet Aker (Supervisor: Prof. Rob Gaizauskas)

Automatic Image Captioning Techniques, Automatic Text Summarization Techniques, Automatic Multi document Summarization Techniques

The number of images tagged with location information on the web is growing rapidly. GPS equipped cameras and phones make it possible to take pictures that are already indexed with GPS coordinates, and online social sites like Flickr and Facebook ensure quick spread of these images. However, GPS coordinates and/or minimal user written captions are the only descriptions for the majority of these images. This makes image indexing, organization and search a difficult task. Therefore methods which could automatically supplement the information available for image indexing and lead to improved image retrieval would be extremely useful. I am working on a technique for automatic image captioning or enhancement of existing captions. My captioning technique takes only a set of place names pertaining to an image. For example an image showing Eiffel Tower needs to be tagged with the name 'Eiffel Tower'. We have methods to obtain these place names using GPS information only. This place name is passed to my system that then generates a short description about it. This technique has an advantage over related work in automatic image captioning in that it is sufficient to have only the GPS information associated with the image to generate captions.


Niraj Aswani (Supervisor: Prof. Rob Gaizauskas)

Bootstrapping - A part of speech tagger for the Hindi language

Developing a POS tagger for the Hindi language has become a recent interest of researchers in the NLP community and as yet only a small number of publications describe POS tagging for the Hindi language. Recently a new technique has been proposed whereby parallel corpora are used to bootstrap new taggers. Having the parallel corpus for two languages, the source language (SL) and the target language (TL), sentences from both languages are aligned as accurately as possible. A tagger for SL is then used to tag the words of SL sentences and these tags are projected onto the respective aligned words of TL. In this way the text in TL is automatically tagged and can later be used for bootstrapping a POS tagger for the TL.

We aim to experiment the same method where POS tags from the English words will be projected over the Hindi aligned words. This requires English-Hindi texts to be aligned at sentence and subsequently at word level. Since English and Hindi are widely different in structure and style, a very large number of phenomena need to be dealt with when translating between such a language pair. The final expected outcome of the PhD is not only a Hindi part-of-speech tagger (and other alignment tools) but an investigation of how similar the two languages are through a thorough analysis of the issues which need to be taken care of while developing a POS Tagger.


Danica Damljanovic (Supervisor: Prof. Hamish Cunningham)

Natural Language Interfaces to Conceptual Models

Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SeRQL or SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain or ontology. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. In this thesis, we study the usability of NLIs from two perspectives: that of the developer who is customising the NLI system, and that of the end-user who uses it for querying. We investigate whether methods such as feedback, clarification dialogs, and query refinement can increase the usability for end users and reduce the customisation effort for the developers. We will test our methods through the development of the NLI system which will be evaluated using the Mooney dataset containing: a knowledge base in OWL format, a set of questions classified by complexity, and a set of correct SPARQL queries which correspond to each question.


Leon Derczynski (Supervisor: Prof. Rob Gaizauskas)

Temporal Information Extraction

Documents describe information using natural language. A critical part of language is tense - describing when things happen. As the state of the world around us changes constantly over time, it is critical to understand what is true at any one time. To this end, we need to be aware of temporal information in text, and to process this accurately. The aim of my research is to develop and improve methods for temporal information extraction, processing and understanding when events described in text occur, and what depends on what. Reichenbach, Vendler and Allen have all contributed some useful foundations to this kind of work. My work will tackle some outstanding problems in the field of temporal IE, providing practical solutions, improving the performance of time-related text processing systems and perhaps even providing theoretical contributions.


Samuel Fernando (Supervisor: Dr. Mark Stevenson)

Enriching knowledge bases using relation extraction

Knowledge sources such as WordNet are widely used in language processing applications. These are typically manually constructed. The benefit of this is high accuracy. However one key problem is that it is time consuming to adapt with new words, new domains. The aim of my research is to develop and evaluate automatic methods to enrich new relations from text. These are then used to enrich WordNet. Currently using Wikipedia as the source for new relations.

Key tasks in the work:

  • Align WordNet synsets with Wikipedia articles to ensure they both refer to same concept (using text similarity metrics).
  • Use relation extraction methods to find new relations within the Wikipedia articles. These can be supervised (using existing WordNet relations as training data), or unsupervised (using Hearst patterns e.g. X such as Y).
  • Insert new relations into appropriate places in the ontology and evaluate for accuracy and improvement in external tasks.


    Dowe Gelling (Supervisor: Dr Trevor Cohn)


    Wei Liu (Supervisor: Prof. Yorick Wilks)

    Identifying and Correcting Transliterated Word Misuse in Chinese

    The Internet has become the most popular platform for communication. However because most of the modern computer keyboard is Latin-based, Asian languages such as Chinese cannot input its characters (Hanzi) directly with these keyboards. As a result, methods for representing Chinese characters using Latin alphabets were introduced. The most popular method among these is the pinyin input system. Pinyin is also called "Romanised" Chinese in that it phonetically resembles a Chinese character. Due to the highly ambiguous mapping from Pinyin to Chinese characters, word misuses can occur using standard computer keyboard, and more commonly so in internet chat-rooms or instant messengers where the language used is less formal. We aim to develop a system that can automatically identify such word misuse, whether they are simple typos or whether they are intentional. After identifying the word misuse, the system should suggest the correct word to be used.


    Rao Nawab (Supervisor: Dr Mark Stevenson)

    Plagiarism Detection

    Incidences of plagiarism in higher education are widely believed to be on the increase. A large number of documents are freely available on World Wide Web in multiple languages, making it easier and more tempting for students to plagiarise. There is also evidence that current plagiarism detection systems cannot detect plagiarism when the source document has been obfuscated, for example when it has been translated from another language or heavily paraphrased. The aim of my research is to develop and evaluate automatic methods that can detect plagiarism and overcome the limitations of existing approaches.


    Daniel Preotiuc-Pietro (Supervisor: Dr. Trevor Cohn)

    My thesis deals with temporal modeling of Online Social Network (OSN) textual data. OSNs represent a new data source that is much more rich than traditional sources in terms of size, time granularity and multimodality. Due to this richness of data, time patterns are interesting to study in order to discover the evolution of textual information at different time periods, emergence of new content and the correlation in time of textual information with the other aspects of the data (e.g. location).

    The thesis will analyse the temporal dynamics of three aspects of the textual data in OSNs: the topic, the linguistic form and the location from where it was emitted. The objectives are to gain insight into some real world aspects like topic emergence and volatility, human behavior and language evolution. With these objectives in mind, we adapt existing or build novel machine learning techniques, like graphical models, and evaluate them qualitatively as well as by integrating them in end-user applications (e.g. recommender systems) in order to assess their usefulness.

    The aim of the thesis is to prove that temporal analysis of OSN data is a better choice for analysis than data from traditional sources as it gives us more insight into real-world dynamics and at more fine-grained time intervals.


    Angus Roberts (Supervisor: Prof. Rob Gaizauskas)

    Ontologies for information extraction and integration in bioscience and clinical research


    Kumutha Swampillai (Supervisor: Dr. Mark Stevenson)

    Cross-sentence relation extraction

    Extracting semantic relationships between known entities in natural language text is an active research area in NLP. Current techniques for relation extraction identify relations contained within a single sentence; however, studies have shown that a significant proportion of relations occuring in major information extraction corpora (ACE 2003 and MUC6) occur over sentence boundaries. Cross-sentential relation extraction poses many challenges such as increased computational complexity and data sparsity issues.

    This research aims to see whether supervised machine learning approaches to single sentence relation extraction can be adapted to recognise cross-sentence relations. This requires a representation, for the relations, which captures features of the two sentences in which the entities occur and the intervening text. Instances of these relations are used for training a relation extraction system which identifies relations occuring across sentences.


    Mark Tice (Supervisor: Prof. Rob Gaizauskas)

    Investigating the use of grid technologies to aid information extraction

    Automated training methods are increasingly being used in the development of natural language systems. Bootstrapping algorithms may be used to remedy the lack of sufficient annotated data by attempting to grow a much larger set, given only a small initial 'seed' set of pre-annotated examples. The seed data is analysed, statistics are gathered and generalisations are made; these are then used to annotate a pool of unannotated data, and the most accurate examples are put forward as correct annotations; these may be added to the seed pool, and the process repeats.

    This class of algorithms should be well suited to distribution, given that a relatively small amount of data needs to be communicated, and the iterative processing is generally quite intensive. At the same time, the application of distribution in these cases does present some interesting problems. In particular, learning models from subsets of the seed data will produce a number of individual models which give individual sets of results, without an obvious way of combining them. It could be possible to use these separate models in a combined manner, taking the majority vote as a confidence measure. Alternatively, some form of model-recombination may be attempted whereby each node puts forward its results after a number of local iterations, allowing a central node to gather and evaluate a global view. The nodes are then provided with an updated model and seed pool, and continue for another round of iterations.


    Milena Yankova (Supervisor: Prof. Hamish Cunningham)

    TERMS: Text Extraction from Redundant and Multiple Sources

    This work is focused on providing a general solution to the identity problem and recognise different mentions of one and the same fact coming from different sources. Our main hypothesis is that variations of one and the same fact are filterable and even increase the correctness of fact extraction. Fact variations, presented in different ways, will improve the ability of a system to recognise at least one mention of the fact. Once aggregated, the information can be easier analysed and searched, retrieving more accurate and relevant result. Further we aim at generalising, so to cover identity resolution of entities regardless their data structure and the source they come from.