1 / 67

Exploring the tiers of Japanese vocabulary: Academic, literary and beyond

Exploring the tiers of Japanese vocabulary: Academic, literary and beyond. Tatsuhiko Matsushita LALS, Victoria University of Wellington tatsuhiko.matsushita@vuw.ac.nz. Main findings. VDRJ is useful for designing curriculum (material, tests etc.)

hayley
Download Presentation

Exploring the tiers of Japanese vocabulary: Academic, literary and beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring the tiers of Japanese vocabulary: Academic, literary and beyond Tatsuhiko Matsushita LALS, Victoria University of Wellington tatsuhiko.matsushita@vuw.ac.nz

  2. Main findings • VDRJ is useful for designing curriculum (material, tests etc.) • The more domains a words is shared as AWor LAD by, the more abstract the meaning of the word is. • Conversation and non-academic textscontain more general words and LW • Academic texts: more AW and LAD but less LW in any academic domain • Wikipedia: more proper nouns and low frequency words • Newspapers and academic items of Wikipediacan be a good resource for learning AW and LAD. • Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts • Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AWand LADoriginate in Chinese • LAD contains more Western origin words (Gairaigo)

  3. Contents • Motive for this research • Goals of this presentation • Vocabulary Database for Reading Japanese • Tiers of Japanese vocabulary (Basic words, academic words, limited-academic domain words, literary words) • Text coverage by word tier • Proportions of word origin types by word tiers • Number of characters required to cover the word tiers • Implications from the findings • Conclusion

  4. 1. Motive for this research How efficiently can we learn vocabulary? • Learning burden is big! • More effective choice of target words • More efficient order for learning the words • Effective choice and efficient order: to maximize the coverage of text which the learner would encounter in his/her domain = Reading comprehension and lexical density (Hu & Nation, 2000; Komori et al., 2004)  Q. What words should learners learn first? And second and next?

  5. Studies on EAP vocabulary • Basic: General Service List (West, 1953) • Academic: AWL (Coxhead, 2000) UWL (Xue & Nation, 1984) • EGAP-A/S, EGAP-HM/SS etc. (Tajino, Dalsky, & Sasao, 2009) • Science-specific Word List (Coxhead & Hirsh, 2007) • Technical: e.g. Chung (2003) • Literary vocabulary?

  6. Studies on JAP vocabulary • Basic: The former JLPT list, Tamamura (1987) etc. • Academic: Butler (2010), Matsushita (2011) • ? • Technical: Komiya (1995), Oka (1992) etc. • Others • No list for words between academic and technical words • Literary vocabulary?

  7. 2. Goals of this presentation To introduce • the Vocabulary Database for Reading Japanese • extracted domain-specific words such as Academic Words (AW), Limited-Academic-Domain Words (LAD), Literary Words (LW) To argue about • how the word tiers work in different types of text (register variation) • how learner’s language background possibly affects the understanding of texts in different genres

  8. 3. Vocabulary Database for Reading Japanese • Vocabulary Database for Reading Japanese (VDRJ)(Matsushita, 2010; 2011) • Created from the Balanced Contemporary Corpus of Written Japanese, 2009 monitor version (NINJAL, 2009) • 33 million token (28 million from books and 5 million from the Internet forum sites (Yahoo Chiebukuro)) • 19 million content words and 14 million function words • Unit of counting: Lexeme – considerably inclusive but less inclusive than the word family (Level 6 in Bauer & Nation, 1993) in English • “Short unit of lexemes” are ranked by U (usage coefficient) (Juilland& Chang-Rodrigues, 1964) • Short unit of lexeme: more inclusive than “lemma”, less inclusive than “word family”

  9. Some problems of existing Japanese word frequency lists • Lack of representativeness • Too old • The corpus size is not large enough: low reliability for low frequency words • No good sub frequency data which enable us to calculate dispersion to downgrade unevenly distributed words

  10. Advantages of word lists * Various types of word lists can be created from the vocabulary database (VDRJ) • Reference for developing vocabulary tests = Checking learners’ vocabulary levels • Reference for checking vocabulary level of material = Checking vocabulary levels of materials •  Specify vocabulary for learners to learn and for teachers to teach For better choice of material, modification of text Cf. Nation (2011), Word profiler

  11. How to make VDRJ • Method • Classify all the texts into some sub corpora to see the range and dispersion cf. Nippon Decimal Classification, BCCWJ (NINJAL, 2009) • Parse (made word segmentation of ) all the texts by a morphological analyzer with a dictionary (if the text is not segmented by space between words.) cf. MeCab, UniDic • Make word lists by AntConc and/or AntWordProfiler

  12. Content and construct of VDRJ • Vocabulary Database for Reading Japanese • The list is for reading as it is made from written corpus of books and internet forum sites • Written and spoken languages are different in word frequency, domain and required language processing skills ⇒ A good corpus of spoken language is necessary to develop a good word list for it(, but there is no very good corpus of spoken Japanese…)

  13. Content of the sub corpora

  14. Different word rankings • The word ranking problem mainly exists in Basic Words • This is mainly due to lack of good spoken corpora • Compromise: frequency weighted to limited domains which seem to reflect basic daily needs • For International Students • For General Learners • Non-weighted (ranking for overall written Japanese)

  15. Multidimensional scaling (MDS) 10 domains 10 domains + word familiarity

  16. 4. Tiers of Japanese vocabulary (1) The concept of “word tiers” • Domain / Level • Level = general importance = frequency × dispersion Some words are frequent only in a particular domain e.g. 発送 (shipping) 振り込み (paying by bank transfer)  古墳 (tumulus / burial mound)

  17. Assumed word tiers for students Level • Basic: Top 1288 = Former JLPT Level 4 &3 • Intermediate: Ranked 1289-5000 • Advanced 1: 6K-10K • Advanced 2: 11K-15K • Super-Advanced: 15K-20K • 21K+ • Assumed Known Words (AKW) Domain *General / Academic / Literary

  18. 4. Tiers of Japanese vocabulary (2) Basic words (BW) • Feature of the corpus: formal written language similar to BNC (Nation, 2004) • No good spoken corpus for vocabulary studies • Compromise • For learners and teachers lists, the former JLPT Level 4 $ 3 vocabulary is put at the top of the list as basic words To order the basic words • Identify closer domains to word familiarity (basic needs) by Multidimensional Scaling (MDS) • Frequency in literary works and the Internet-forum sites (Yahoo-Chiebukuro) is weighted

  19. 4. Tiers of Japanese vocabulary (3) Academic domain words Extracting academic domain words • Log-likelihood ratio (LLR)(Dunning, 1993) • Target texts: Technical texts • Classified into four large academic domains • Total number of tokens: approx. 2.9 million • Reference texts: General texts in BCCWJ 2009 • Total number of tokens: approx. 29.9 million • Extract keywords shared by 4 - 1domains • Cut off point: higher for more narrowly distributed words

  20. 4. (3) Academic domain words • Academic words (AW):high specificity in 3+ academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words(cut off point: LLR > 0) • Limited-academic-domain words (LAD) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Eliminate the former JLPT Level 4 vocabulary (Top 700 words) • Eliminate the words ranked at 20001 or lower • Classify all the AW and LAD by word ranking levels for International Students (U=Usage Coefficient): • 5 levels: Basic / Inter. / Adv. 1 / Adv. 2 / Super-adv.

  21. 4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) • JAWL = Japanese Academic Word List • High specificity in 3 or 4 academic domains • 4-domain words (cut off point: LLR > 0) • 3-domain words (cut off point: LLR > 0) • Level 0 - VIII9 levels,2590 words in total • JAWL I (Intermediate): most essential for learning • Basic words contains much fewer academic words • JAWL I: 559 words Close to AWL in number and text coverage Coverage in the academic corpus used for extracting AW AWL: 10.0%JAWL I: 11.1%

  22. Distribution and examples of JAWL

  23. 4. (3) -1 Academic words (AW)Semantic features of AW (1) • Highly abstract, essential for operating logic i.e. • Range: 占める (occupy, account for), 特殊 (special, particular) • Relation: 属する (belong to), 依存 (rely/reliance) • Comparison/Evaluation: 後者 (the latter), 優れる (superior), • Quantitative change: 減少 (decrease), 強化 (reinforce) • Stage: 当初 (beginning), 現状 (present condition) • Development of enunciation: 取り上げる (take up [an issue]), まとめる (summarize) • Cause-effect, degree, agent, action, object, direction, goal, instrument, time etc.

  24. 4. Tiers of Japanese vocabulary (3) -1 Academic words (AW)Semantic features of AW (2) The most frequent Kanji used for AW 合 (combine, together), 定 (fix, certain), 分 (divide, minute), 一 (one), 同 (same), 数 (number), 上 (up), 体 (body), 出 (out), 大 (large) • 3-domain words: Some words have concrete meanings e.g. 署名 (signature), 保健 (health, hygiene) • 4-domain words: Few words have concrete meanings • The nature of the words are the same at all levels

  25. POS of Japanese AW (1) • Common noun: 1072 words (41.4 %) e.g. 背景 (background) • Verbal noun: 882 words (34.0 %) e.g. 連続 (establish/-ment)  Adding other types of nouns together, 2104 words (81.2 %) can be a noun • Verb (excluding verbal nouns): 225 words (8.7 %) e.g. 認める (recognize/approve) 述べる (describe/mention)  Adding other types of verbs together, 1107 words (42.7%) can be a verb • Adjectival noun: 95 words (3.7 %) e.g. 詳細 (detail/-ed), 平等 (equal/-ity) • Adjective:Only 9 words (0.3 %) e.g. 著しい (remarkable)

  26. POS of Japanese AW (2) • Affix: 106 words (4.1 %) e.g. -期 (period),-種 (type) substantial in Japanese academic words • Adverb: 34 words (1.3 %) e.g. しばしば (frequently) • Other (particle, auxiliary verb etc.): 22 words (0.8 %) • Remarkably many archaic words e.g. のみ (only), つつ (while doing), べし (ought to), あらゆる (every) いかなる (any), 我が (my), 漠然 (vague) • れる/られる (Passive/Potential/Spontaneous) specific in academic texts

  27. 4. (3) -2 Limited-academic-domain words (LAD) • Limited-academic-domain words (LAD) • High specificity in 2 or 1 domain(s) • 2-domain words (cut off point: LLR > 1) • 1-domain words (cut off point: LLR > average value) • Something between “academic” and “technical” • The “scams” from extracting AW? • Tiers of curriculum cf. Tajino et al. (2007) • Words correspondent to the curriculum • Basic: all the learners • Academic words: prep. to first year • Limited-academic-domain words (?): prep. to major • Technical words: major to postgrad.

  28. 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

  29. 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

  30. 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

  31. Examples of 2 domain words: Words which are shared by only 2 main academic domains

  32. 4. (3) -2 Limited-academic-domain words (LAD) 2 domain words • Semantic features • More concrete and specific than academic words • Ah & Ss: Social, overlap in history and ethnology • Ss & Tn: Industrial • Ss & Bn: Social security, medical and nursing service • Tn & Bn: Scientific • Ah & Tn, Ah & Bn: not clear

  33. 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • It is merely a trial • The corpus is not the best for academic purpose, especially for natural sciences • Extracting something common across domains is much easier while extracting words by only one target corpus will require more complete target corpus • Therefore, AW (4 domain words and 3 domain words) will be more reliable than LAD (2 domain words and 1 domain words)

  34. 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words

  35. 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • Semantic features are much clearer than 2 domain words

  36. 4. (3) -2 Limited-academic-domain words (LAD) 1 domain words • Semantic features are much clearer than 2 domain words

  37. POS of Japanese LAD (1) • Common noun: 1605 words (63.1 %) – more than AW (41.4%) • Verbal noun: 633 words (24.9 %) e.g. 融資 (finance) cf. AW (34.0%)  Adding other types of nouns together, 2104 words (87.9 %) can be a noun – more than AW (81.2%) • Verb (excl. verbal nouns): 81 words (3.2 %) cf. AW (8.7%) e.g. 訳す (translate) 向き合う (face (v.))  Adding other types of verbs together, 714 words (28.1%) can be a verb – less than AW (42.7%) • Adjectival noun: 88 words (3.5 %) cf. AW (3.7%) e.g. フル (full), 偉大 (great) • Adjective:Only 3 words (0.1 %) cf. AW (0.3%) e.g. 硬い (stiff)

  38. POS of Japanese LAD (2) • Affix: 109 words (4.3%) cf. AW (4.1%) e.g. –犯 (offense) substantial in Japanese academic domain words • Adverb: 15 words (0.6 %) cf. AW (1.3%) e.g. 現に (surely) • Other (particle, auxiliary verb etc.): 9 words (0.8 %) cf. AW (0.8%) • Remarkably many archaic words – similar to AW e.g. なり [affirmative aux.], とも (even though), たり [affirmative aux.], ごとし (as/like), 単なる (mere), しめる(=しむ) [causative aux.], かかる (such)

  39. 4. Tiers of Japanese vocabulary (4) Literary words (LW) Extracting literary words: Words for reading literary works • Log-likelihood ratio (Keyness in AntConc) • Target corpus: literary works (identified by NDC and C-code) in BCCWJ 2009 (NINJAL, 2009) – Over 8 million tokens • 4 different reference corpus: Technical texts, general texts in arts and humanities, general texts in the other 3 academic domains, Internet forum texts (Yahoo Chiebukuro) • Extract keywords shared by the four results (Cutoff point: average value) • Eliminate the former JLPT Level 4 vocabulary (Top 700 words) • Eliminate the words ranked at 20001 or lower • Classify all the LW by word ranking levels for International Students (U=Usage Coefficient)

  40. 4. (4) Literary words (LW) Distribution and examples

  41. 4. (4) Literary words (LW) POS of LW • More verbs, adverbs and interjections than AW and LAD • Less verbal nouns and adjectival nouns • This inevitably means LW have less loan words but more Japanese-origin words.

  42. 4. (4) Literary words (LW) Q. How many LW overlap with AW and LAD? • Only 27 words (0.5% of academic domain words, 1.7% of LW) are overlapping • Most of the overlapping words (24/27) overlap with 1 domain words (17 words overlap with words in biological natural science) • Many physical words such as words for body parts e.g. 左手 (left hand), こぶし (fist), 血 (blood),頭上 (overhead) • No LW words overlap with 4 domain words • Overlapping words are mainly at the intermediate level • No overlapping words in or above 11K+ • Some examples of overlapping words:音 (sound), 光 (light), 棚 (shelf), 組 (class), 岩 (rock), ひざ (knee), 興奮 (excite/-ment), 全身 (whole body), 帝 (emperer), ネズミ (mouse), 帆 (sail)

  43. Word tiers: In what order should students learn them? Highly Advanced General AW/LAD LW Super-Advanced General AW/LAD LW Assumed known words Proper names Fillers, Signs (Transparent compounds *) Others • Basic • General • AW/LAD • LW • Intermediate • General • AW/LAD • LW • Advanced • General • AW/LAD • LW

  44. 5. Text coverage by word tier • The word tier analyser: An Excel sheet where word profiling of a text can be checked automatically by cutting and pasting the result of AntWordProfiler with the word tier base word list. • Text covering efficiency High efficiency in vocabulary learning = Fewer unique lexemes cover more texts (Reciprocal Type/Token Ratio = Token/Type Ratio?) *Comparison should be made between equally-sized texts)

More Related