To use computational methods in studying language or to develop prototype language processing application systems, both data and processing resources are necessary. Data resources -- corpora, both annotated and unannotated -- are necessary for analysis and for training and testing components and systems. Reusable processing resources -- such as tokenizers, part-of-speech taggers, parsers -- enable new research and development to build on earlier efforts and free researchers from re-implementing components for each new project.

Enabling multiple data and processing resources to be accessible and to interoperate within a single environment is a challenging task and requires a language processing platform or architecture.

Our contribution   The NLP group has developed perhaps the best known and most widely used architecture for language engineering -- the General Architecture for Text Engineering. GATE is a powerful open source, Java-based platform for language engineering with capabilities for processing a wide range of document formats (XML, HTML, PDF, Word, email, plain text, etc.), building modular systems from reusable components and storing, evaluating and visualising results. It can be used as a research platform or as an integrated development environment for building complex language processing systems, which can then be embedded in larger end user applications. GATE is delivered with a set of information extraction tools developed at Sheffield and in addition a wide of contributed or third party modules have been integrated within it.


Kalina Bontcheva, Hamish Cunningham, Rob Gaizauskas, Diana Maynard, Wim Peters, Yorick Wilks


