Frequency 2.0: Applying corpus research intelligently

Frequency 2.0:Applying corpus research intelligently Tom Cobb, UQAMTC Columbia Colloquium2011 Feb 11

Colloquium: "Frequency 2.0: Applying corpus research intelligently" • Abstract Colloq: "Frequency analysis should tell us what is important to know about a language, starting with vocabulary, but the many frequency approaches in the pedagogies of the past have seemed inexplicably ineffective. That was not because the frequency principle is incorrect, but because the corpora underlying the analysis were small and the unit of analysis was the word form. The larger corpora now available have generated new frequency lists that, while remaining stuck on the word form, nonetheless point toward a more intelligent analysis penetrating within and beyond the word. This colloquium will explore the means and motivation of an intelligent approach to frequency."

Frequency from the bottom up In two senses Sense 1

Bottom 1. In an ESL career… • ESP reading in the Middle East boom years • Medicine, engineering, physics… • Years of the “authentic text” • Weak learners • Most obviously in vocab • Common experience of hanging up for 20 minutes on a “hard word” • That never occurs again in the text • Only gradually realizing which words are recurring • Or whether we had seen them before

From the bottom of a career… • Endless unsequenced texts • Books rarely arrive on time for term • Type them up for first week, second week… • Gradually it’s a corpus

Enter the Osborne luggable Running a Frequency program on a “floppy” • What are the words to ignore in this passage • What are the high repeaters?

e.g., • Relativistic heavy ion physics is of international and interdisciplinary interest to nuclear physics, particle physics, astrophysics, condensed matter physics and cosmology. The primary goal of this field of research is to re-create in the laboratory a novel state of matter, the quark-gluon plasma (QGP), which is predicted by the standard model of particle physics (Quantum Chromodynamics) to have existed ten millionths of a second after the Big Bang (origin of the Universe) and may exist in the cores of very dense stars. • STAR searches for signatures of quark-gluon plasma formation and investigates the behavior of strongly interacting matter at high energy density by focusing on measurements of hadron production over a large solid angle. It utilizes a large volume Time Projection Chambers (TPC) for tracking and particle identification in a high track density environment. STAR will measure many observables simultaneously on an event-by-event basis to study signatures of a possible QGP phase transition and the space-time evolution of the collision process at their respective energy. The goal is to obtain a fundamental understanding of the microscopic structure of hadronic interactions, at the level of quarks and gluons, at high energy densities. • STAR is one of two large-scale experiments under construction at the Relativistic Heavy Ion Collider (RHIC) at the Brookhaven National Laboratory (BNL) on Long Island (New York) for operation in 1999. It has been designed to focus primarily on hadronic observables and features a large acceptance for high precision tracking and momentum analysis at center of mass (c.m.) rapidity. Specific to RHIC will be: • significantly increased particle production (thousands of particles produced). • hard parton-parton scattering in heavy ion collisions.

Yet the limits of a bare freq list • Case: Big pre teaching of word “respect” • Admiration for aged persons, etc • While only use in text is… • “with respect to” • “in this respect” Solution: Frequency with context

So let’s combine Freq and Context • http://www.lextutor.ca/concordancers/text_concord

1. Familize the text 2. Calculate “keyness” of each family  Frequency in this text cf.frequency in a standard corpus http://www.lextutor.ca/keyword If I were doing this now, I would…

Frequency analysis made the job doable

But the frequency analysis gets more sophisticated… • When we add in frequency-in-the language data • This makes coverage calculations possible • For particular word sets • Or particular knowledge levels • We can say, e.g., that a learner who knows the 1000 most frequent words of English… • Knows about 65% of the items in a science text • Or about 70% of the items in a newspaper text • Or about 80% of the non-proper items in a TV sitcom

1970’s corpora showed us the typical distributions of words in a language

A few words give a lot of coverage While most words are extremely infrequent

This frequency analysis seemed like a boon… • SUDDENLY it became possible to analyze what lexis a learner would need to know to read a certain kind of text • Or pass a certain kind of test

Many mysteries resolved by frequency testing…

But frequency analysis gets even more sophisticated… • So lexis comes in a small high-frequency pack • And lower freq trail-off • In a ZIPFIAN distribution • Practical utility of learners knowing this pack seems indubitable • But it goes beyond that

Zipf • There is a predictable correlation between (logs of)type rank(1,2,3)and token frequency (19, 15, 9) Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic philology.Cambridge, MA: MIT Press.

A few words used heavily Many words used little In a systematic few-many ratio All lex frequency patterns are typically “Zipfian”Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamicphilology. Cambridge, MA: MIT Press. Type:token =Rank: Freq=Rank: Cover- age) r =-.98

Is lexis the only zone with a Zipfian distribution? I.e. with a few VHF items And a “long tail” of lower frequency items I have been collecting instances of studies showing possibly Zipfian distributions for three years To write a review paper Showing a general Zipfian portrait of English Across as many aspects of language as possible With a view to identifying “the parts of English we need to teach” Why? A belief that we language teachers systematically focus learners’ attention on low frequency items In all dimensions, but notably idioms “Raining cats and dogs” at 3 per 100,million occurrences >>Because they are interesting to linguists, psychologists, us… >>But not what learners need first

More on why A quick diversion • Al-Ain text? • Brossard vocab test results?

Multiple Zipfian ZonesThe grist for my review ~ • Word stress. Of all the many possible English stress patterns, only a few are used. (Murphy and Kandil, 2004).

Pronunciation segmentals. Of all the phonetically important features, only a few make a real difference. (Jenkins, 2002) Grammar. In the Longman Grammar of Spoken and Written English, Biber et al identify a syllabus of the “few most used,” as opposed to all possible, grammatical forms. (Biber et al, 200x) Multi-Word Units • Simpson and Mendis (2003) suggests that, at least within genres, we may find high frequency core items.

MWUs (cont’d) • Gardner and Davis (2010) focusing specifically on phrasal verbs (verb-particle combinations like look up) • found that 20 lexical verbs combine with just eight adverbial particles • in 160 combinations • to account for more than 50% of the >500,000 phrasal verbs in the British National Corpus (BNC) • Martinez & Schmitt (2010) “A phrasal expressions list. “ • Many VHF MWUs are more frequent than many VHF words • Viz, 505 of these belong in the most frequent 5,000 lex items of BNC freq list • (Or >10% of the most common words )

Biber et al, Longman Grammar of Spoken & Written English. • Gardner, D. & Davis M. (2007). Pointing Out Frequent Phrasal Verbs: A Corpus-Based Analysis. TESOL Qtrly. • Jenkins,J. (2002). A Sociolinguistically Based, Empirically Researched Pronunciation Syllabus for English as an International Language. Applied Linguistics 23(1):83-103. • Murphy, J. & Kandil, M. (2004). Word-level stress patterns in the academic word list. System 32 (1 ), 61-74. • Simpson, R. & Mendis, D. A corpus-based study of idioms in academic speech. TESOL Qtrly, 37(3): 419-441. • Martinez & Schmitt (2010) A phrasal expressions list. (First author’s PhD).

Only to find… About two weeks ago • That Nick Ellis and colleagues have developed a thoroughgoing model • Of Zipfian patterning pervading language at all levels • Including semantic • And linked it to the big questions of language acquisition

Ellis: the Zipfian distribution provides a cognitive explanation of language acquisition • Longstanding question • Do we need an LAD? • To explain the acquisition of something so massively complex as language? • Ellis argues no • And the Zipfian pile-ups are the reason LAD is unnecessary • Language can emerge from usage alone • Zipfian patterning is what creates language learnability

Nick C. Ellis, Diane Larsen-Freeman (2009) Constructing a Second Language: Analyses and Computational Simulations of the Emergence of Linguistic Constructions From Usage • Language Learning 59: pp. 90–125

Frequency promotes learning, and psycholinguistics demonstrates that language learners are exquisitely sensitive to input frequencies of patterns at all levels. In the learning of categories from exemplars, acquisition is optimized by the introduction of an initial, low-variance sample centered on prototypical exemplars, which allows learners to get a “fix” on what will account for most of the category members. Then the bounds of the category can later be defined by experience of the full breadth of exemplars.

“…in natural language, the type-token frequency distributions of construction islands, their prototypicality and generality of function in these roles, and their reliability of mappings between these, together, conspire to optimize learning.”

…These data demonstrate that learner VAC development [Verb Argument Constructions] is seeded by the highest frequency, prototypical, and generic exemplar across learners and VACs. These are the exemplars that are provided in NS-nonnative speaker interaction. The use of such exemplars presumably facilitates comprehension in the micro-discursive moment and perhaps their subsequent emergence and ultimate acquisition. (p 106) • Frequency promotes learning, and psycholinguistics demonstrates that language learners are exquisitely sensitive to input frequencies of patterns at all levels (Ellis, 2002). In the learning of categories from exemplars, acquisition is optimized by the introduction of an initial, low-variance sample centered on prototypical exemplars (Elio & Anderson, 1981, 1984; Posner & Keele, 1968, 1970), which allows learners to get a “fix” on what will account for most of the category members. Then the bounds of the category can later be defined by experience of the full breadth of exemplars.

Probably Zipfian

Bottom-up # 2 • In other words • Language can be/is learned bottom-up • Or is “data driven” • Data is adequate for acquisition • Through an initial high-repetition focus • on a high-potency mini-language of prototypes • And then gradually expanded outward Or “down the tails” of the distribution

SO if we accept this argument… • Would this revolutionize language instruction? • Not necessarily • Zipfian patterns have been known true of lexis • for >50 years • Yet have not been simple to implement

A disagreement with Ellis • Nick feels that (L1) learners “get the Zipfian experience for free” (Jan 27 2011) • A part of natural exposure to language • His goal is to explain language acquisition • Not to design instruction • But do L2 learners get the Zipfian version of L2 for free? • Do course book writers, teachers etc know how to work the proto-zones of language? • DO they know what they are?

Case study: Quebec ESL learners with 3500 word families • Are nonetheless weak readers • Why? • Is the vaunted prediction of reading comprehension from vocabulary knowledge untrue?

Levels Test (10k) at Time 1 7.00 6.00 5.00 4.00 Scores/10 3.00 2.00 1.00 0.00 1 2 3 4 5 6 7 8 9 10 Group 1 6.33 5.03 4.13 4.90 4.87 2.73 2.93 2.80 1.70 1.47 Group 2 6.07 4.76 4.20 4.87 4.84 2.73 2.83 2.74 1.64 1.45 K-levels Cobb & Horst (in press, Calico) research in Brossard Qc

Levels Test (10k) at Time 1 7.00 6.00 5.00 4.00 Scores/10 3.00 2.00 1.00 0.00 1 2 3 4 5 6 7 8 9 10 Group 1 6.33 5.03 4.13 4.90 4.87 2.73 2.93 2.80 1.70 1.47 Group 2 6.07 4.76 4.20 4.87 4.84 2.73 2.83 2.74 1.64 1.45 K-levels 50%>>

What would a thorough frequency analysis look like? • We know what the “islands” are in any area of LL • We have tests to measure learners’ grasp of these islands • We know when it is time to start moving down the tails • We know what must be taught and what can be left to independent learning

How much work this would be… • And the potential benefits to be gained • Can be inferred from progress in the one area already tackled, lexis • We review • What has been achieved • Remaining challenges • Ongoing work

The frequency movement is old in lexis • Michael West, GSL, AWL, Harold Palmer, Paul Nation • Yielding many fruits • But still not 100% accepted

Nation’s contribution • From the base of a fully lemmatized frequency list of most of the entire language • Full familization of the BNC frequency lists • but much beyond • FREQ + RANGE • GET PIC OF LISTS • We have a powerful battery of frequency based tests • and materials creation tools • All depending on very recent developments in text computing

Described e.g. in Nation (2006) • Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63, 59–81.

Frequency 2.0: Applying corpus research intelligently