1 / 20

The METER Corpus: A corpus for analysing journalistic text reuse 

The METER Corpus: A corpus for analysing journalistic text reuse . Robert Gaizauskas 1 , Jonathan Foster 2 , Yorick Wilks 1 , John Arundel 2 , Paul Clough 1 , Scott Piao 1 1 Dep ar t ment of Computer Science , 2 Department of Journalism. University of Sheffield. Outline of Talk.

rock
Download Presentation

The METER Corpus: A corpus for analysing journalistic text reuse 

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The METER Corpus: A corpus for analysing journalistic text reuse  Robert Gaizauskas1, Jonathan Foster2, Yorick Wilks1, John Arundel2, Paul Clough1, Scott Piao1 1Department of Computer Science, 2Department of Journalism University of Sheffield

  2. Outline of Talk • The METER Project and the METER Corpus • Text Reuse in the British Press • Construction of the Corpus • Structure of the Corpus • Annotation of the Corpus • Preliminary Experiments with the Corpus • Conclusion/Discussion Corpus Linguistics 2001, Lancaster

  3. The METER Project and the METER Corpus • The MEasuring TExt Reuse (METER) project aims • to investigate how text is reused in the production of newspaper articles from newswire sources • to determine whether algorithms can be discovered to detect and quantify such reuse automatically • From this hope to gain broader insights into the nature of text derivation and paraphrase • newspaper-newswire scenario provides an ideal initial case study • newspaper-newswire scenario has considerable potential application • To assist in this study have constructedthe METER corpuscontaining • newswire source texts • newspaper articles reporting the same stories • some derived from the newswire texts • some not derived from the newswire texts Corpus Linguistics 2001, Lancaster

  4. B A C The Text Derivation Game A? B? C? Corpus Linguistics 2001, Lancaster

  5. Text Reuse in the British Press • The Press Agency (PA) is the national news agency for the UK and Ireland • provides regional, national and international news 24 hours a day, 365 days a year to media customers throughout Britain+ abroad • daily sources 1,500 news, sport and feature stories • also supplies finance, arts and entertainment and television listings, and materials for websites, magazines, and periodicals • PA performs a critical function for the British media in setting the news agenda • widely regarded as a credible, authoritative and trustworthy journalistic source • PA is widely reused • directly: cut and paste; paraphrase • Indirectly: fact checking; “copy tasting” Corpus Linguistics 2001, Lancaster

  6. Text Reuse in the British Press: Example PA version A drink-driver who ran into the Queen Mother's official Daimler was fined £700 and banned from driving for two years today. Eamon Reidy, 32, was two-and-a-half times over the drink-drive limit when he rammed the royal car, magistrates in Woking, Surrey, were told. The 99-year-old Queen Mother was not in the vehicle when the accident happened on July 4 in Bishopsgate, Egham, Surrey. Magistrates were told that Reidy sped off before abandoning his car, running across fields and hiding in undergrowth until he was spotted by the police helicopter. Prosecuting Robin Bowen said: ``At 8pm the defendant was driving towards Englefield Green in a black Citroen BX and collided with a Daimler limousine, a vehicle which was used on a daily basis by the Queen Mother. She was not in it at the time. It was being driven by a chauffeur + 11 sentences The Times Eamon Reidy, 32, a drink-driver who rammed into Queen Elizabeth the Queen Mother's Daimler, was fined £700 and banned from driving for two years. The Queen Mother was not in car when the accident happened on July 4 in Surrey. The Telegraph A driver was almost three times over the limit when he crashed into Queen Elizabeth the Queen Mother's Daimler then fled, a court was told yesterday.  Eamon Reidy, 32, reversed away but crashed his Citroen BX into a wall at Egham, near Windsor Great Park, Surrey. He then ran off and was caught after a mile-and-a-half chase. The Mirror A BOOZY driver who smashed into the Queen Mum's chauffeur-driven Daimler minutes after she had been dropped off was banned for two years and fined £700 yesterday. Eamon Reidy, 32, fled across fields in Windsor Great Park after the crash, the court heard. Grandad John Horton, 56, head gardener on the royal estate, chased him in his slippers for one and a half miles as armed cops, dogs and helicopter joined in the pursuit. John caught up with the fugitive and grabbed his arm. But when Reidy threatened him - "he decided discretion was the better part of valour and let him go," Woking magistrates were told. Police discovered airport worker Reidy lying in undergrowth near the Queen Mum's Royal Lodge on the Crown estate. He was found to be two-and-a-half times over the legal limit. Reidy, of Langley, Berks, admitted drink- driving and failing to stop. The Sun A DRUNK driver who ploughed into the Queen Mother's limo was fined £700 and banned for two years yesterday. Eamon Reidy, 32, was 2½ times over the legal limit when he rammed the parked Daimler in a country lane. The Queen Mum - 99 last week - was not in the car at the time but her chauffeur was. Airport worker Reid sped off after the smash near Egham, Surrey, on July 4. He glanced off a wall and flattened some bushes before abandoning his Citroen. Chased Then he ran 1½ miles across fields chased by crash witness John Horton. Mr Horton finally cornered, him - but Reidy threatened him and fled. Reidy, of Langley, Berks, tried to hide in some undergrowth. But he was spotted by a police helicopter and arrested, magistrates in Woking, Surrey, heard. Defending, Lesley Barry said Reidy was trying to buy a house and had money worries. He had drunk two glasses of champagne at his parents' wedding anniversary party before drinking three pints of strong lager at a pub. The Star A DRUNK driver who crashed into the back of the Queen Mum's limo was banned for two years yesterday. Airport worker Eamon Reidy, 32, was nearly three times the drink-drive limit when he hit the royal Daimler after a two-and-a-half hour session in the pub. He reversed his black E-reg Citroen BX after the crash and hit a wall before fleeing the crash scene. But he was chased for a mile-and-a-half by a passer-by who gave police a description of the Citroen driver. A helicopter and armed police were drafted into the search and Reidy was found hiding in bushes. The Queen Mother who uses the Daimler daily, was not in the car when it was hit. Reidy refused to comment after the case at Woking magistrates' court. He hit the chauffeur-driven car, registration NLT 2, in Bishopsgate Road, Egham, Surrey, last month.  Head gardener John Horton, 56, chased Reidy, who told his pursuer to leave him alone or he would "have him". Reidy was found in bushes by police, but ran off again before he was finally arrested. + 11 sentences Corpus Linguistics 2001, Lancaster

  7. Text Reuse in the British Press: Utility of Measuring Reuse • Like most newswire agencies, PA does not monitor uptake or dissemination of copy they release because they lack • tools • technologies • conceptual framework for measuring reuse • Potential applications of accurately measuring reuse include: • monitoring of source take-up to identify unused or little used stories • identifying the most reused stories within the British media • determining customer dependencies on PA copy • new methods for charging customers based upon the amount of copy reused Corpus Linguistics 2001, Lancaster

  8. Construction of the Corpus • Texts of the METER corpus were collected manually • from the PA online service • the paper editions of nine British newspapers The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The DailyTelegraph, The Guardian and The Independent • Scope of corpus is limited to two domains • British law court reporting • show business stories • Court stories • substantial amount of datainnewspapers and PA • regular recurrence in British news • revolve around “facts” -- name of the accused, charge, etc. -- limited scope for journalistic interpretation • Show business • more expansive style -- greater freedom of expression/interpretation • more frivolous, light-hearted manner Corpus Linguistics 2001, Lancaster

  9. Construction of the Corpus (cont) • Temporal extent of corpus is limited to • 24 days for court domain • 13 days for show business domain • Spread over 1 year period from July 1999 to June 2000 • PA stories are classified • Into broad categories: Courts, Showbiz • Stories within these categories called catchlines – e.g. Courts(Axe), Courts(Strangle), Courts(Gamekeeper) • Updates for each catchline, called PA pages, throughout the day • For each selected catchline • All PA pages downloaded • Final Southern paper editions of 9 dailies from next day examined • Selected newspaper articles were scanned and spell-corrected Corpus Linguistics 2001, Lancaster

  10. Source Domain Total Law and Court Show Business Words Texts Words Texts Words Texts WD PD ND WD PD ND PA 206,354 661 (205 catchlines) 33,325 112 (60 catchlines) 239,679 773 (265 catchlines) Other 1,269 0 3 2 0 0 0 0 1,269 5 Times 34,794 24 41 46 2,966 5 7 2 37,760 125 Star 14,021 15 28 27 7,590 10 19 7 21,611 106 Express 21,956 17 27 18 5,270 1 8 5 27,226 76 Mirror 17,359 22 32 28 4,211 7 11 6 21,570 106 Mail 31,686 21 29 7 6,414 0 7 7 38,100 71 Guardian 38,499 12 46 37 3,805 4 6 3 42,304 108 Telegraph 45,768 30 62 35 2,985 6 7 2 48,753 142 Sun 18,597 18 37 15 6,010 4 24 7 24,607 105 Independent 28,689 7 37 46 3,582 2 7 1 32,271 100 Total 458,992 1,430 76,158 287 535,150 1,717 Construction of the Corpus: Statistics Corpus Linguistics 2001, Lancaster

  11. Construction of the Corpus: Story Overlap Corpus Linguistics 2001, Lancaster

  12. Structure of the Corpus meter corpus news papers PA Showbiz Courts Showbiz Courts rawtext annotated rawtext annotated ... ... 12.07.99 21.06.00 21.06/00 ... 12.07.99 Catch line 1 Catch line 1 Catch line N Catch line N … ... Page 1 Page N Newspaper N Newspaper 1 ... ... Lowest level of alignment Corpus Linguistics 2001, Lancaster

  13. Annotation of the Corpus • The METER corpus is annotated at two levels: • The document level – indicating degree of derivation from PA • The word sequence level – indicating extent of text reuse • All annotations were carried out by a single professional journalist • Second judgments are being collected for 5% of the material to validate the annotations Corpus Linguistics 2001, Lancaster

  14. Annotation of the Corpus: Classification at the Document Level • Each document in the newspaper portion of the corpus is classified to indicate its derivational relation to the PA: • Wholly derived (WD) – all content of the target text is derived only from the PA source text • Partially derived (PD) – some content of the target text is derived from the source text. Other sources have also been used • Non-derived (ND) – no content of the target text is derived from the source text. Although verbatim and rewritten text may appear in the target text, the context, overlap of entities or use of source text is not indicative of reuse Corpus Linguistics 2001, Lancaster

  15. Annotation of the Corpus: Classification at the Word Sequence Level • About ½ of the newspaper texts (~450) are annotated at the level of word sequences • Verbatim: text that is reused from PA word-for-word in the same context • Rewrite: text that is reused from PA, but paraphrased to create a different surface appearance. The context is still the same • New: text not appearing in PA or apparently verbatim or rewritten, but used in a different context.  Corpus Linguistics 2001, Lancaster

  16. <Title> (optional) No attributes <Body> (required) No attributes <Verbatim> (optional) Attributes: PA_src: the source PA sentence(s) (optional) <Rewrite> (optional) Attributes: PA_src: the source PA sentence(s) (optional) <New> (optional) Attributes: PA_src: the source PA sentence(s) (optional) Annotation of the Corpus: DTD <METER document> (required) Attributes: filename:filename of the text (required) newspaper:the newspaper name (required) domain:courts or showbiz (required) classification:either wholly-derived, partially-derived or non-derived (optional) pagenumber: the newspaper page number (optional) date:the date of publication (required) catchline:the catchline as given by the journalist (required) Corpus Linguistics 2001, Lancaster

  17. Annotation of the Corpus: DTD -- Example Original PA version BANKER'S BITTERNESS LED TO SYSTEMATIC THEFTS By Lyndsay Moss, PA News A middle-aged banker who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years today. Trusted Derek Boe, 48, used some of the money to splash out on holidays, buy a car and a caravan, and pay for expensive home improvements. Telegraph version: A BANKER who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years yesterday. Derek Boe, 48, used some of the money for holidays, to buy a car and a caravan, and to pay for home improvements. Annotated Telegraph version: <!DOCTYPE meterdocument SYSTEM "meter_corpus/dtds/meter.dtd"> <meterdocument filename="meter_corpus/newspapers/annotated/courts/16.07.99/banker/banker125_telegraph.sgml",newspaper="telegraph", domain = "courts", classification="wholly-derived", pagenumber="4", date="16.07.99", catchline="banker"> <body> <verbatim PA_src="">A </verbatim> <verbatim PA_src="">BANKER who stole more than </verbatim> <rewrite PA_src="">£270,000 </rewrite> <verbatim PA_src="">from his bosses because he resented younger staff being promoted over his head, was jailed for four years </verbatim> <rewrite PA_src="">yesterday. </rewrite> <verbatim PA_src="">Derek Boe, 48, used some of the money </verbatim> <rewrite PA_src="">for </rewrite> <verbatim PA_src="">holidays, </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">buy a car and a caravan, and </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">pay for </verbatim> <verbatim PA_src="">home improvements. </verbatim> </body> </meterdocument> Corpus Linguistics 2001, Lancaster

  18. Preliminary Experiments with the Corpus • Initial experiments are underway to explore techniques for detecting whether a candidate reused text is wholly derived, partially derived or non-derived from a PA source text. • Techniques being investigated include: • Dotplot • Information retrieval text similarity measures (tf.idf) • Word n-gram overlap measures • 50-70% correct identification of document level classification • Statistical alignment techniques • 80-90 % correct identification of document level classification Corpus Linguistics 2001, Lancaster

  19. Conclusion/Discussion • Have presented the METER corpus • first corpus to attempt to support the study of (legitimate) text reuse • first corpus to attempt to systematically align source/derived text in the journalistic world • Texts are derived from two domains (Courts and Showbiz) over a period of one year • Texts are annotated at two levels • Document level – a course indication of derivation/reuse • Word sequence level – a fine-grained indication of derivation/reuse Corpus Linguistics 2001, Lancaster

  20. Conclusion/Discussion • Corpus is limited in terms of • Scope (2 domains only) • Temporal extent (36 days over 1 year only) • Size (1717 stories in total) • Annotation content (no links back to source texts) • Annotation accuracy (one annotator; evolving conception of annotation guidelines) • Primary purpose is to serve as a pilot – if useful/interesting subsequent versions or related corpora can be created • Limited free copies of beta release version available *only* at Corpus Linguistics 2001 • Distribution through ELRA/LDC being investigated Corpus Linguistics 2001, Lancaster

More Related