corpus was created by the Departments of Journalism and
Computer Science at Sheffield
University as part of the EPSRC-funded
METER project. The METER
corpus is a novel resource for the study and analysis of journalistic text
reuse. The corpus consists of a set of news stories written by the
Press Association (PA), the
major UK news agency, and a set of stories about the same news events as
published in nine British newspapers: The Sun, Daily Mirror, Daily Star, Daily
Mail, Daily Express, The Times, The Daily Telegraph, The Guardian and The
cases the newspaper stories are rewritten from the PA source; in other cases
they have been independently written by the newspapers' own journalists. The
corpus provides a snapshot of contemporary news reports in the British Press
and a resource for the analysis of text reuse in journalism. A list of
publications on the METER corpus and text reuse can be found
The METER corpus is available for research purposes only and can
be downloaded for free for academic use. We require that you fill out a
form and after submitting this you will be sent an email containing details
on how to download the corpus.
The METER corpus is available in three formats:
(1) Plaintext - newswire and
newspaper stories are stored in a parallel directory structure. The stories are
organised by date and catchline.
(2) SGML format - as above, except that texts are stored in SGML
to encapsulate annotations added manually by trained journalists.
(3) XML format -
stored in the Text Encoding Initiative (TEI) format in separate files - 1 file
for each day containing both newswire and newspaper stories.
More information about the plaintext
and SGML versions can be found here and XML version here.
We would like to thank
the UK Press Association for providing us access to their newsire service and
invaluable help and advice.
Last Modified: February