The University of Sheffield
Natural Language Processing Group

Language Resources and Architectures

To use computational methods in studying language or to develop prototype language processing application systems, both data and processing resources are necessary. Data resources -- corpora, both annotated and unannotated -- are necessary for analysis and for training and testing components and systems. Reusable processing resources -- such as tokenizers, part-of-speech taggers, parsers -- enable new research and development to build on earlier efforts and free researchers from re-implementing components for each new project.

Enabling multiple data and processing resources to be accessible and to interoperate within a single environment is a challenging task and requires a language processing platform or architecture.

Our contribution   The NLP group has developed perhaps the best known and most widely used architecture for language engineering -- the General Architecture for Text Engineering. GATE is a powerful open source, Java-based platform for language engineering with capabilities for processing a wide range of document formats (XML, HTML, PDF, Word, email, plain text, etc.), building modular systems from reusable components and storing, evaluating and visualising results. It can be used as a research platform or as an integrated development environment for building complex language processing systems, which can then be embedded in larger end user applications. GATE is delivered with a set of information extraction tools developed at Sheffield and in addition a wide of contributed or third party modules have been integrated within it.


Kalina Bontcheva, Hamish Cunningham, Rob Gaizauskas, Diana Maynard, Wim Peters, Yorick Wilks


Current and Recent