1 / 32

Homing in on the Text-Initial Cluster

Homing in on the Text-Initial Cluster. Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics. Starting Questions.

MikeCarlo
Download Presentation

Homing in on the Text-Initial Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Homing in on the Text-Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics

  2. Starting Questions • Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position? • Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text? • If so, are there any common patterns in text-initial clusters?

  3. Context • Textual Priming Project, University of Liverpool • Michael Hoey • Michaela Mahlberg • Matthew O’Donnell • Mike Scott

  4. Textual Priming Project: Aims • to investigate how many (and what types of) lexical items are primed to appear in text-initial or paragraph-initial position • to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts. • to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis. from O’Donnell et al 2007

  5. Hard News Corpus • “Home News” sections of the Guardian and Observer • 1998 to 2004 • 115,654 articles • divided thus: • headline & lead • 1st sentence of 1st paragraph (TISC) • all other sentences • TISC contains 3.2 million tokens • The rest: 51.2 million tokens • About 470 words per article

  6. Research Questions Using the hard news corpus, • How many 3-5 word clusters are found to be key in TISC sections? • How many are positively and how many are negatively key? • What recurrent patterns can be found in the two types of key cluster?

  7. Methods (1) • Format the corpus in XML and separate out all TISC sections (done by Matt O’Donnell) • Use WordSmith’s WordList tool to compute wordlist indexes of • all the text • all the TISC sections • Using WordList, compute 3-5 word clusters for each index, save as .lst

  8. Top clusters, all sections GUARDIAN CO UK ONE OF THE A HREF HTTP, WWW GUARDIAN CO and similar web links THE PRIME MINISTER THE END OF AS WELL AS THE NUMBER OF THERE IS A SOME OF THE THERE IS NO

  9. Top clusters, TISC ONE OF THE ACCORDING TO A LAST NIGHT AFTER FOR THE FIRST THE FIRST TIME IS TO BE FOR THE FIRST TIME THE MURDER OF ARE TO BE THE DEATH OF OF THE MOST THE HOME SECRETARY WAS LAST NIGHT IT EMERGED YESTERDAY AS PART OF AN ATTEMPT TO THE UNITED STATES THE NUMBER OF ONE OF THE MOST ACCORDING TO THE

  10. Methods (2) • Use KeyWords tool to compute KWs for the TISC 3-5 word clusters using all the text as a reference corpus • Identify patterns in the KW clusters

  11. TISC key clusters WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR YESTERDAY COURT HEARD YESTERDAY WAS TOLD YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD BOY YESTERDAY WHEN THE WITH THE MURDER OF ACCORDING TO A LAST NIGHT AFTER IT EMERGED YESTERDAY WAS LAST NIGHT ARE TO BE THE MURDER OF LAST NIGHT WHEN THE GOVERNMENT YESTERDAY LAST NIGHT AS IS TO BE

  12. Numbers of Key Clusters

  13. RQs 1 & 2: Numbers of KW clusters using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic, • 8,132 key clusters altogether (in 3.2 million words of text) • of which 7,631 were positively key • and 501 negatively key though there is repetition as these are 3-5 word n-grams Research Question 2

  14. Repetition YESTERDAY FOUND GUILTY YESTERDAY FOUND GUILTY OF YESTERDAY FROM A YESTERDAY FROM THE YESTERDAY GAVE A YESTERDAY GAVE HIS YESTERDAY GAVE THE YESTERDAY GIVEN A YESTERDAY GIVEN THE YESTERDAY GIVEN THE GO YESTERDAY GIVEN THE GO AHEAD

  15. Negatively key: SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR THE SAID HE WAS IT IS NOT THERE WAS NO A LOT OF A SPOKESMAN FOR THERE IS NO HE SAID THE SAID IT WAS THERE IS A THIS IS A THE FACT THAT AS WELL AS IT WOULD BE

  16. RQ 1: Numbers of KW clusters • Is 8 thousand a large number of distinct key text-initial clusters? • In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether… • about one in 10 is associated with text initial position at the .0000001 level of significance

  17. RQ 1, continued • … is 1 in 10 a large number to be key? • In the case of SISC (sentences from paragraphs with only one sentence in), we get • 507 thousand clusters, of which • 2,192 are key (1,747 positively and 445 negatively) • which is about 1 in 230

  18. PATTERNS

  19. RQ 3: patterns • recency: • in the top 200, seventy express time, generally using yesterday or last night

  20. Recency clusters YESTERDAY IN A IT EMERGED LAST NIGHT A COURT HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS THE YESTERDAY WHEN THE WAS TOLD YESTERDAY COURT HEARD YESTERDAY TONY BLAIR YESTERDAY YESTERDAY AFTER A WERE LAST NIGHT LAST NIGHT AS THE GOVERNMENT YESTERDAY LAST NIGHT WHEN WAS LAST NIGHT IT EMERGED YESTERDAY LAST NIGHT AFTER

  21. Superlatives ONE OF BRITAIN'S MOST ONE OF THE MOST OF THE WORLD'S THE FIRST TIME OF BRITAIN'S MOST FOR THE FIRST FOR THE FIRST TIME

  22. Research, Report etc. ACCORDING TO A REPORT A COURT HEARD (YESTERDAY) ACCORDING TO RESEARCH TO A SURVEY IT EMERGED LAST NIGHT IT WAS ANNOUNCED YESTERDAY IT WAS REVEALED YESTERDAY A REPORT PUBLISHED ACCORDING TO A STUDY TO RESEARCH PUBLISHED

  23. Attention-grabbers IT EMERGED THAT OBSERVER CAN REVEAL THE OBSERVER CAN REVEAL

  24. Indefinite articles positively key…. A LABOUR MP A LANDMARK RULING A LAST DITCH ATTEMPT TO A LAST MINUTE A LEADING BRITISH A LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE A BABY GIRL A BAN ON A BEACH IN A BID TO A BITTER ROW A BLACK MAN A BLISTERING ATTACK ON A JURY WAS TOLD YESTERDAY

  25. Indefinite articles negatively key A KIND OF A COUPLE OF A GREAT DEAL A KIND OF A LOT MORE

  26. IT + reporting verb – positively key IT WAS ANNOUNCED LAST NIGHT IT WAS CLAIMED LAST NIGHT IT WAS CONFIRMED LAST NIGHT IT IS REVEALED TODAY

  27. IT otherwise negatively key: IT IS A IT IS ABOUT IT IS EXPECTED IT IS GOING IT IS ONLY IT IS POSSIBLE IT SEEMS TO

  28. SAID YESTERDAY – positively key SAID YESTERDAY AFTER SAID YESTERDAY THAT HE SAID YESTERDAY THEY HAD

  29. SAID without time – negatively key SAID AT THE SAID HE HAD SAID HE WOULD SAID THE GOVERNMENT SAID THERE WAS NO

  30. Conclusions • The “once upon a time” syndrome seems to be much more common than might be thought. • In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance • whereas in non text-initial sections only 1 in 230 was key

  31. Other patterns • recency • superlatives • research, report • attention-grabbers • indefinite articles • IT + reporting verb; SAID + time

  32. References • O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007.

More Related