homing in on the text initial cluster n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Homing in on the Text-Initial Cluster PowerPoint Presentation
Download Presentation
Homing in on the Text-Initial Cluster

Loading in 2 Seconds...

play fullscreen
1 / 32

Homing in on the Text-Initial Cluster - PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on

Homing in on the Text-Initial Cluster. Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics. Starting Questions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Homing in on the Text-Initial Cluster' - noelani-figueroa


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
homing in on the text initial cluster

Homing in on the Text-Initial Cluster

Mike Scott

School of English

University of Liverpool

Aston Corpus Symposium

Friday May 4th 2007

This presentation is at www.lexically.net/downloads/corpus_linguistics

starting questions
Starting Questions
  • Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position?
  • Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text?
  • If so, are there any common patterns in text-initial clusters?
context
Context
  • Textual Priming Project, University of Liverpool
    • Michael Hoey
    • Michaela Mahlberg
    • Matthew O’Donnell
    • Mike Scott
textual priming project aims
Textual Priming Project: Aims
  • to investigate how many (and what types of) lexical items are primed to appear in text-initial or paragraph-initial position
  • to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts.
  • to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis.

from O’Donnell

et al 2007

hard news corpus
Hard News Corpus
  • “Home News” sections of the Guardian and Observer
  • 1998 to 2004
  • 115,654 articles
  • divided thus:
    • headline & lead
    • 1st sentence of 1st paragraph (TISC)
    • all other sentences
  • TISC contains 3.2 million tokens
  • The rest: 51.2 million tokens
  • About 470 words per article
research questions
Research Questions

Using the hard news corpus,

  • How many 3-5 word clusters are found to be key in TISC sections?
  • How many are positively and how many are negatively key?
  • What recurrent patterns can be found in the two types of key cluster?
methods 1
Methods (1)
  • Format the corpus in XML and separate out all TISC sections (done by Matt O’Donnell)
  • Use WordSmith’s WordList tool to compute wordlist indexes of
    • all the text
    • all the TISC sections
  • Using WordList, compute 3-5 word clusters for each index, save as .lst
top clusters all sections
Top clusters, all sections

GUARDIAN CO UK

ONE OF THE

A HREF HTTP, WWW GUARDIAN CO and similar web links

THE PRIME MINISTER

THE END OF

AS WELL AS

THE NUMBER OF

THERE IS A

SOME OF THE

THERE IS NO

top clusters tisc
Top clusters, TISC

ONE OF THE

ACCORDING TO A

LAST NIGHT AFTER

FOR THE FIRST

THE FIRST TIME

IS TO BE

FOR THE FIRST TIME

THE MURDER OF

ARE TO BE

THE DEATH OF

OF THE MOST

THE HOME SECRETARY

WAS LAST NIGHT

IT EMERGED YESTERDAY

AS PART OF

AN ATTEMPT TO

THE UNITED STATES

THE NUMBER OF

ONE OF THE MOST

ACCORDING TO THE

methods 2
Methods (2)
  • Use KeyWords tool to compute KWs for the TISC 3-5 word clusters using all the text as a reference corpus
  • Identify patterns in the KW clusters
tisc key clusters
TISC key clusters

WERE LAST NIGHT

YESTERDAY AFTER A

TONY BLAIR YESTERDAY

COURT HEARD YESTERDAY

WAS TOLD YESTERDAY

WAS JAILED FOR

THE DEATH OF

YEAR OLD BOY

YESTERDAY WHEN THE

WITH THE MURDER OF

ACCORDING TO A

LAST NIGHT AFTER

IT EMERGED YESTERDAY

WAS LAST NIGHT

ARE TO BE

THE MURDER OF

LAST NIGHT WHEN

THE GOVERNMENT YESTERDAY

LAST NIGHT AS

IS TO BE

rqs 1 2 numbers of kw clusters
RQs 1 & 2: Numbers of KW clusters

using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic,

  • 8,132 key clusters altogether (in 3.2 million words of text)
  • of which 7,631 were positively key
  • and 501 negatively key

though there is repetition as these are 3-5 word n-grams

Research Question 2

repetition
Repetition

YESTERDAY FOUND GUILTY

YESTERDAY FOUND GUILTY OF

YESTERDAY FROM A

YESTERDAY FROM THE

YESTERDAY GAVE A

YESTERDAY GAVE HIS

YESTERDAY GAVE THE

YESTERDAY GIVEN A

YESTERDAY GIVEN THE

YESTERDAY GIVEN THE GO

YESTERDAY GIVEN THE GO AHEAD

negatively key
Negatively key:

SPOKESMAN FOR THE

PER CENT OF

WE HAVE TO

SAID THAT THE

BUT IT IS

AT A TIME

A SPOKESMAN FOR THE

SAID HE WAS

IT IS NOT

THERE WAS NO

A LOT OF

A SPOKESMAN FOR

THERE IS NO

HE SAID THE

SAID IT WAS

THERE IS A

THIS IS A

THE FACT THAT

AS WELL AS

IT WOULD BE

rq 1 numbers of kw clusters
RQ 1: Numbers of KW clusters
  • Is 8 thousand a large number of distinct key text-initial clusters?
  • In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether…
  • about one in 10 is associated with text initial position at the .0000001 level of significance
rq 1 continued
RQ 1, continued
  • … is 1 in 10 a large number to be key?
  • In the case of SISC (sentences from paragraphs with only one sentence in), we get
  • 507 thousand clusters, of which
  • 2,192 are key (1,747 positively and 445 negatively)
  • which is about 1 in 230
rq 3 patterns
RQ 3: patterns
  • recency:
  • in the top 200, seventy express time, generally using yesterday or last night
recency clusters
Recency clusters

YESTERDAY IN A

IT EMERGED LAST NIGHT

A COURT HEARD YESTERDAY

YESTERDAY WHEN A

YESTERDAY AFTER THE

EMERGED LAST NIGHT

LAST NIGHT TO

YESTERDAY AS THE

YESTERDAY WHEN THE

WAS TOLD YESTERDAY

COURT HEARD YESTERDAY

TONY BLAIR YESTERDAY

YESTERDAY AFTER A

WERE LAST NIGHT

LAST NIGHT AS

THE GOVERNMENT YESTERDAY

LAST NIGHT WHEN

WAS LAST NIGHT

IT EMERGED YESTERDAY

LAST NIGHT AFTER

superlatives
Superlatives

ONE OF BRITAIN'S MOST

ONE OF THE MOST

OF THE WORLD'S

THE FIRST TIME

OF BRITAIN'S MOST

FOR THE FIRST

FOR THE FIRST TIME

research report etc
Research, Report etc.

ACCORDING TO A REPORT

A COURT HEARD (YESTERDAY)

ACCORDING TO RESEARCH

TO A SURVEY

IT EMERGED LAST NIGHT

IT WAS ANNOUNCED YESTERDAY

IT WAS REVEALED YESTERDAY

A REPORT PUBLISHED

ACCORDING TO A STUDY

TO RESEARCH PUBLISHED

attention grabbers
Attention-grabbers

IT EMERGED THAT

OBSERVER CAN REVEAL

THE OBSERVER CAN REVEAL

indefinite articles positively key
Indefinite articles positively key….

A LABOUR MP

A LANDMARK RULING

A LAST DITCH ATTEMPT TO

A LAST MINUTE

A LEADING BRITISH

A LEADING SCIENTIST

A LEGAL BATTLE

A LEGAL CHALLENGE

A BABY GIRL

A BAN ON

A BEACH IN

A BID TO

A BITTER ROW

A BLACK MAN

A BLISTERING ATTACK ON

A JURY WAS TOLD YESTERDAY

indefinite articles negatively key
Indefinite articles negatively key

A KIND OF

A COUPLE OF

A GREAT DEAL

A KIND OF

A LOT MORE

it reporting verb positively key
IT + reporting verb – positively key

IT WAS ANNOUNCED LAST NIGHT

IT WAS CLAIMED LAST NIGHT

IT WAS CONFIRMED LAST NIGHT

IT IS REVEALED TODAY

it otherwise negatively key
IT otherwise negatively key:

IT IS A

IT IS ABOUT

IT IS EXPECTED

IT IS GOING

IT IS ONLY

IT IS POSSIBLE

IT SEEMS TO

said yesterday positively key
SAID YESTERDAY – positively key

SAID YESTERDAY AFTER

SAID YESTERDAY THAT HE

SAID YESTERDAY THEY HAD

said without time negatively key
SAID without time – negatively key

SAID AT THE

SAID HE HAD

SAID HE WOULD

SAID THE GOVERNMENT

SAID THERE WAS NO

conclusions
Conclusions
  • The “once upon a time” syndrome seems to be much more common than might be thought.
  • In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance
  • whereas in non text-initial sections only 1 in 230 was key
other patterns
Other patterns
  • recency
  • superlatives
  • research, report
  • attention-grabbers
  • indefinite articles
  • IT + reporting verb; SAID + time
references
References
  • O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007.