The METER Corpus
Download page
Introduction

The METER corpus was created by the Departments of Journalism and Computer Science at Sheffield University as part of the EPSRC-funded METER project. The METER corpus is a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events as published in nine British newspapers: The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The Daily Telegraph, The Guardian and The Independent.

In some cases the newspaper stories are rewritten from the PA source; in other cases they have been independently written by the newspapers' own journalists. The corpus provides a snapshot of contemporary news reports in the British Press and a resource for the analysis of text reuse in journalism. A list of publications on the METER corpus and text reuse can be found here.

Obtaining the corpus

The METER corpus is available for research purposes only and can be downloaded for free for academic use. We require that you fill out a registration form and after submitting this you will be sent an email containing details on how to download the corpus.

The METER corpus is available in three formats:

(1) Plaintext - newswire and newspaper stories are stored in a parallel directory structure. The stories are organised by date and catchline.

(2) SGML format - as above, except that texts are stored in SGML to encapsulate annotations added manually by trained journalists.

(3) XML format - stored in the Text Encoding Initiative (TEI) format in separate files - 1 file for each day containing both newswire and newspaper stories.

More information about the plaintext and SGML versions can be found here and XML version here.

For more information

If you require more information about the METER corpus, please contact either Paul Clough or Rob Gaizauskas:

Paul Clough, Department of Information Studies, University of Sheffield, UK (p.d.clough@sheffield.ac.uk)

Robert Gaizauskas, Natural Language Processing Group, Department of Computer Science, University of Sheffield, UK (R.Gaizauskas@dcs.shef.ac.uk)

Acknowledgements

We would like to thank the UK Press Association for providing us access to their newsire service and invaluable help and advice.

Last Modified: February 2005

By: Paul Clough