1 / 43

The Leo Corpus

The Leo Corpus. German L1 Learner Corpus. Overview. corpora in child language research CHILDES project Leo corpus CLAN language analysis tools. Corpora in acquisition research?. linguistic intuitions of native speakers? adult speakers’ intuitions fail child will not speak on demand

tilly
Download Presentation

The Leo Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Leo Corpus German L1 Learner Corpus

  2. Overview • corpora in child language research • CHILDES project • Leo corpus • CLAN language analysis tools

  3. Corpora in acquisition research? • linguistic intuitions of native speakers? • adult speakers’ intuitions fail • child will not speak on demand • child can’t judge own sentences Leo (1;11,16): Leiter hoch. ‘ladder up/high’ ? particle: (pull/push/...) up ? adjective: (the ladder is) long  utterance context is needed!

  4. Corpora in acquisition research? • linguistic intuitions of native speakers? • adult speakers’ intuitions fail • child will not speak on demand • child can’t judge own sentences Leo (1;11,16): Leiter hoch. looking at a long ladder in a book adult German: (The) ladder (is) long  adjective (in adult terms!)

  5. Corpora in acquisition research • corpora contain actually made utterances • situated in natural contexts • data are verifiable • frequency analyses possible Kinds of corpora: • diary studies (e.g., Preyer 1882) • experimental data • spoken speech corpora (longitudinal / cross-sectional)

  6. CHILDES • Child Language Data Exchange System • Brian MacWhinney / Catherine Snow • founded in 1984 • part of TalkBank (adult corpora) • > 1500 published articles • 4500 Members

  7. CHILDES 3 parts: • language data, i.e. corpora • CHAT transcription system • CLAN computer programs

  8. CHILDES corpora • 130 corpora publicly available (via www) • 26 languages • L1 normally developing • L1 language disorders • bi- and trilingual children and adults TalkBank adult corpora: • L2, aphasics (English, German, Hungarian, Chinese, Italian), ...

  9. CHAT and CLAN • CHAT: • Codes for the Human Analysis of Transcripts • ensure a standard format for all corpora • CLAN: • Computerized Language Analysis • several commands for analyzing data in CHAT format • single program interface

  10. Leo corpus • Leo, monolingual L1 German boy • recorded 1999 – 2002 • Heike Behrens, MPI Leipzig • transcribed in CHAT format • analysable with CLAN programs • not publicly available

  11. Leo corpus 2;0 – 3;0: 5 x 1hr / week = 20-22hrs / month + diary for new structures 3;0 – 5;0: 5 x 1hr / month = 5hrs / month  ca. 400hrs total recording time • includes utterances of child and conversation partners • spontaneous interaction (free play) • no book-reading • experimenter present • some sessions videotaped

  12. Leo corpus • 1.8 million words of spoken speech • child: ca. 500.000 words • BNC: largest balanced corpus • 100 million words • 10% spoken speech: 10 million words • “dense” corpus

  13. “Dense” corpora • longitudinal databases with denser recording intervals • traditional: 0.5 – 1hr / week • Leo: 1.25 – 5hrs / week • assumption: • child is awake and talks 10hrs / day • traditional: ca. 1% of output • Leo: 2% - 7% of output (Tomasello / Stahl 2004)

  14. “Dense” corpora • advantages: • capture of infrequent phenomena • better estimate of vocabulary size • age of emergence • smoother developmental curves • input / production frequency measures

  15. “Dense” corpora Likelihood to capture a target token in a year of recording: (Tomasello / Stahl 2004) tokens: 1/day ` 10/day

  16. Drawback: only 1 child! • no generalizations possible • drawback? • usage-based approach • child is believed to construct language individually • based on personal experience with language • no help from language-specific knowledge

  17. Usage-based approach • child moves gradually from lexically specific to abstract knowledge • no adult categories • input and frequency play a role ( corpus needed!) • close studies of individuals highly valuable • dense longitudinal vs. traditional cross-sectional corpora

  18. “Control” corpora Kerstin & Simone • Max Miller, MPI Nijmegen • 1;3 / 1;9 – 4;0 Kerstin: • 0.5 – 2.7 recordings / month • ca. 270.000 words (child: 55.000) Simone: • 1.25 – 3.5 recordings / month • ca. 450.000 words (child: 86.000)

  19. “Control” corpora Pauline & Sebastian • Prof. Rigol Pauline: • 0;0 – 7;11 / 1 – 2 recordings / month • 340.000 words (child: 85.000) Sebastian: • 0;0 – 7;4 / 1 – 2 recordings / month • 350.000 words (child: 75.000)

  20. Leo corpus • CHAT-format • 1 transcription file per session • txt-format • no running text @Headers (file explanations) *Main tier lines (utterances) %Dependent tiers (annotations of utterances) @End

  21. CHAT: Headers @Begin @Languages: de @Participants: CHI Leo Target_Child, MUT Maren Mother, VAT Thorsten Father, MEC Mechthild Observer @ID: de|mpi_evan|CHI|2;06.08|male|group|middle|Target_Child|education| @ID: de|mpi_evan|MUT|30;00.00|female|group|middle|Mother|Abitur_Lehre| @ID: de|mpi_evan|VAT|35;00.00|male|group|middle|Father|university| @ID: de|mpi_evan|MEC|24;00.00|female|group|middle|Observer|university| @Filename: le020608.cha @Date: 11-SEP-1999 @Age of CHI: 2;06.08 @Comment: Dependent: exp, vrb, act, par, @Comment: in der Wohnung, beim Einkaufen

  22. CHAT: Main tiers • Each utterance on own line • Each line starts with a tier • Each speaker has own tier: *CHI, *VAT, ... • Annotations on dependent tier: %mor, %pho... Child: Yes. Fish! – Father: Fish? – Child: Yes. *CHI: ja . %mor: $INTER|ja . *CHI: Fisch ! %mor: $N:03:m:NOM:SG|Fisch ! *VAT: Fisch ? %mor: $N:03:m:CAS|Fisch ? *CHI: ja .

  23. CHAT: Transcription • orthographic or not? • depends on purpose • orthographic transcription: ease of retrieval • additional information via dependent tiers (%pho) • utterances can be linked to digitized sound files (Sonic-CHAT) • or to video files

  24. SONIC CHAT

  25. CHAT: Transcription • spoken speech not as orderly as written texts • coding scheme for spoken speech phenomena: • overlaps • trailing off • noncompletions is(t) • retracing Schrei [//] Scheibenwischer • non-words hm@o • replacements nix [: nichts]

  26. CHAT: Annotation • annotations to an utterance (tagging) on the dependent tiers of that utterance *CHI: [D] ist drin . %mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin . %exp: es ist noch etwas Kakao im Becher . • here: • %mor: morphology • %exp: explanation of utterance situation

  27. CHAT: Annotation • annotations to an utterance (tagging) on the dependent tiers of that utterance *CHI: [D] ist drin . %mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin . copula suppletive (empty) tense agreement citation form  ist is the 3rd. pers. sing. present tense of the suppletive copula verb sein

  28. CHAT: Annotation • tagging is based on theoretical notions of adult language! • e.g., when ist is tagged as VCOP etc., this doesn’t mean that it constitutes a VCOP for the child *CHI:[D] Leiter hoch . %mor:$N:02:f:AKK:SG|Leiter $PT|hoch . • hoch a verb particle for the child?

  29. CHAT: Annotation • transcription and annotation in CLAN editor • converts txt- and SALT-format files to CHAT • automatic tagging (%mor-tiers) • lexicon file with word information • tag disambiguation (manual / probabilistic) • computes coding reliability • checks conformity with CHAT-conventions • works on different workstations (unlike TRANSANA) • access files on network drive

  30. CLAN / CHAT interface

  31. CLAN Commands • search commands, e.g. • simple and combined strings in utterances and annotations • interaction blocks • imitations, repetitions, overlaps • computing commands, e.g. • mlu / mlt (mean lenght of utterances / turns) • longest words / utterances • vocabulary diversity: TTR, measure D • frequency of phonemes positions

  32. CLAN Commands • commands in DOS-like style Examples research question: WH-words • emergence • frequency • use

  33. Emergence of Interrogative Pronouns coding: *CHI: was machst du ? (lit.: what do you?) %mor: $PRO:int|was [...] search : kwal +t*CHI +t%mor +s$pro:int* le020*.cha search mor-tiers in all files up to 2;9 in the child’s for strings starting with $pro:int

  34. Emergence of Interrogative Pronouns • output: kwal (08-Dec-2004) is conducting analyses on: ONLY speaker main tiers matching: *CHI; and those speakers' ONLY dependent tiers matching: %MOR; […] From file <le020006.cha> From file <le020007.cha> From file <le020008.cha> ---------------------------------------- *** File "le020008.cha": line 2603. Keyword: $pro:int|wo *CHI: Mama, wo bis(t) du ! (Mama, where are you!) %mor: $N:01:f:VOC:SG|Mama $PRO:int|wo $VCOP:S:POS:PRES:2s|sein $PRO:pers:NOM:SG|du $N:01:f:VOC:SG|Mama ! Triple – click to access utterance!

  35. Frequency of Interrogative Pronouns • Does the child’s use match with the input frequency? Child: freq +t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o give for child’s frequency mor-tier of $pro:int for all files together sort count for other up to 2;9 output people’s Input: freq -t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o

  36. Frequency of Interrogative Pronouns • result: child input 536 $pro:int|was 13162 $pro:int|was 305 $pro:int|wo 3486 $pro:int|wo 70 $pro:int|wie 1608 $pro:int|wer 31 $pro:int|wer 1255 $pro:int|wie

  37. Non-interrogative use of wh-words • Make a file with all words for interrogative pronouns Leo uses: freq […] +s"$PRO:INT%|*" +u +d1 >> Leowh.cut for all files without frequency and direct together count numbers output to file

  38. Non-interrogative use of wh-words Leowh.cut: $pro:int|was $pro:int|wo $pro:int|wie $pro:int|wer • strip file of all $pro:int| , so that just the wh-words are left: chstring +s"$pro:int|" "" +y leowh.cut change from to file not in CHAT-format

  39. Non-interrogative use of wh-words • then look for uses of these words in sentences that do not contain $pro:int combo +t*CHI +t%mor le*.cha+s@leowh.cut^*^%mor:^*^!$pro:int* take words from followed by not containing file as search string %mor $pro:int for utterances • combo: search with Boolean operators

  40. Word order in wh-questions • German: verb has to follow wh-word directly – any errors? • search for all utterances that do not follow this pattern: combo +t*CHI +t%mor +s$pro:int*^!$v* le*.cha search for child’s $pro:int not directly followed by any $v

  41. Cooccurences of wh-words • What words does was (what) cooccur with when used as an interrogative pronoun? kwal +t*CHI +t%mor +s$pro:int%"|"was +d +o* le*.cha | cooccur +swas +t*CHI +u • kwal looks for all uses of was as $pro|int • the results are directed to cooccur (“piping”) 6 was da 32 was das 1 was denkst 2 was denn

  42. Measuring lexical diversity • traditional: type-token-ratio (TTR) • number of different word types • against total number of words • every word is a new word: TTR 1.0 • the lower the TTR, the less lexical diversity • problem: depends on sample size • in a large sample, the total vocabulary will finally be exhausted • TTR levels out because highly frequent words will increase the number of tokens disproportionally • rarely occuring types will have little influence on TTR

  43. measure D • measure D is obtained by • randomly sampling the corpus • calculating the actual leveling out of the TTR rate • and comparing this to theoretic models of TTR curves • the probability of new types being introduced in the corpus is calculated, regardless of sample size • In CLAN: • TTR: freq • Measure D: VOCD

More Related