1 / 43

ESP Materials Derived from a Web-based Corpus

Research Seminar. Pisamai Supatranont, Ph.D. Rajamangala University of Technology Lanna Tak, Thailand thing_p@hotmail.com or supatranont@yahoo.com. Wednesday, June 25, 2008. 3.00 – 3.50 pm Building W5C, Room 221, Macquarie University . ESP Materials Derived from a Web-based Corpus.

gezana
Download Presentation

ESP Materials Derived from a Web-based Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Seminar Pisamai Supatranont, Ph.D. Rajamangala University of Technology Lanna Tak, Thailand thing_p@hotmail.com or supatranont@yahoo.com Wednesday, June 25, 2008. 3.00 – 3.50 pm Building W5C, Room 221, Macquarie University ESP Materials Derived from a Web-based Corpus

  2. Presentation Outline • Background and rationale • Research questions • Research methodology • Data analysis and findings • Discussion

  3. Funded by • Conducted in February – July 2008 at The researcher is from RMUTL Tak, Thailand Background of the Study The study is: • Under supervision of Assoc. Prof. David Hall • With consultation of Prof. Pam Peters

  4. Cause Influence of Information and Communication Technology (ICT) in academic and professional settings Effect To get good jobs, university students both in ICT and non-ICT need English to communicate in ICT working environment. Rationale of the Study

  5. 4 Main Considerations for ESP Materials Development • 1. Limitation of relevant ESP textbooks • Although specialized texts in ICT are abundant, they are not suitable for unmodified and unsupported use directly in ESP classes because of their difficulty for EFL students. • Need for teacher-designed materials in ESP teaching.

  6. 4 Main Considerations for ESP Materials Development • 2. Difference of students’ background knowledge • ICT students: • posses some specialized knowledge and skills to design hardware and software. • need English to communicate their knowledge in academic and professional contexts. • Non-ICT students: • have little knowledge of ICT • need ICT knowledge as computer users. • need to learn both basic ICT concepts and English • to communicate in business companies or organizations. • Different learning needs = same level of English = different level of specialized knowledge • Need for different specialized contents to facilitate ESP learning

  7. 4 Main Considerations for ESP Materials Development • 3. Insufficiency of EFL students’ lexical knowledge • It was found that undergraduate students in EFL countries e.g. in Thailand (Supatranont, 2005), Oman (Cobb and Horst, 2001), and Indonesia (Nurweni and Read, 1999) have limited lexical knowledge and less proficient in English than what is expected for students at a university level. • In Supatranont’s study (2005), lexical knowledge of RMUTL students was found below the lexical threshold to academic study. With limited vocabulary size of academic words, students cannot cope well with the specialized texts because most frequent words in these texts consist of academic and sub-technical words (Mundraya, 2006). • Academic and technical words should be integrated as main vocabulary components of language input.

  8. To read academic texts comprehensibly, 95% coverage of words known in that text is the minimum point (Laufer, 1988). Knowledge of these two wordlists is estimated to provide over 90% coverage of academic texts in all disciplines. 4 Main Considerations for ESP Materials Development • Lexical threshold to academic study is composed of two wordlists: (Nation, 2001; Coxhead & Nation, 2001; Cobb & Horst, 2001; and Nation & Waring, 1997) General service list (GSL) = 2,000 high frequency words(West, 1953) (and)Academic word list (AWL) = 570 academic words(Coxhead, 1998) • Academic vocabulary in this study is based on the GSL and AWL (downloaded from http://www.uefap.com/vocab/vocfram.htm)

  9. 4 Main Considerations for ESP Materials Development • 4. Typical problems of Thai students Differences of verb use in English and Thai, English: * time-oriented language * signifying time concepts with various tense forms * different forms of verb inflections and auxiliaries Thai: * ‘tenseless’ language (Baker, 2002) * without verb inflections and auxiliaries * using contexts or adverb of time to convey the time concept * less common use of passive form • Frequent errors in using verb tenses and passive forms in English

  10. Objectives of the Study • To identify high-frequency language items in ICT specialized texts by focusing on five lexical and syntactic areas: • academic words • technical words • collocations • verb tenses • verb usage in a passive form • To obtain a set of language input to design a course material for teaching English for ICT to non-ICT EFL students by using a corpus-based method.

  11. 1 What are high-frequency academic wordsin ICT specialized texts? What are high-frequency technical words in ICT specialized texts? 2 3 4 What are high-frequency collocations in ICT specialized texts? What are high-frequency verb tenses in ICT specialized texts? 5 What are high-frequency usages of verbs in a passive form in ICT specialized texts? Research Questions

  12. Text selection Corpus Compilation Corpus-based analysis The methodology is divided into three main steps Corpus compilation Study a corpus with Text-analysis software Text Selection Research Methodology

  13. Research Methodology Text Selection • Texts selected exclusively from web-based tutorials in ICT • Authors: mostly lecturers in universities and tutorial centers. • 5 topics concerning fundamental ICT knowledge: • Computer hardware • Operating systems and graphical user interfaces (OS and GUIs) • Basic application software • Multimedia software • Internet software • 3 text types: articles, manuals and advertisements (of hardware)

  14. Number of Text Selection Research Methodology Total files = 230

  15. Number of words Research Methodology 1500-2000 w/article 700-1000 w/manual 200-500 w/ad Total words = 287,478

  16. Design of the EICT Corpus Research Methodology

  17. The EICT Corpus Corpuscompilation Research Methodology To compile the corpus, the selected texts are: • Converted into text files: .txt extension • Marked up with text documentation: topic, author, text type, word number, and source • Annotated with POS tagging, using trial service of CLAWS, developed by UCREL at Lancaster University, UK (Available at http://ucrel.lancs.ac.uk/annotation.html#POS)

  18. The EICT Corpus Research Methodology Sample texts with markup and POS tagging <fileDesc> <topic> <OS&GUIs1> </topic> <expert><Research and IT, Calvin College, MI, USA> </expert> <texttype> <article></texttype> <wordsnum> <1574words> </wordsnum> <source> <http://www.calvin.edu/~rbobeldy/tutorials/os/basi> </source> </fileDesc> What_DTQ is_VBZ an_AT0 Operating_NN1 System_NN1 ?_? An_AT0 operating_NN1 system_NN1 is_VBZ a_AT0 group_NN1 of_PRF programs_NN2 that_CJT manage_VVB all_DT0 activities_NN2 on_PRP the_AT0 computer_NN1 ._. When_CJS you_PNP turn_VVB on_PRP a_AT0 computer_NN1 ,_, the_AT0 operating_NN1 system_NN1 programs_NN2 run_VVB and_CJC check_VVB to_TO0 be_VBI sure_AJ0 all_DT0 the_AT0 parts_NN2 of_PRF the_AT0 computer_NN1 are_VBB functioning_VVG properly_AV0 ._. Once_AV0 loaded_VVN ,_, the_AT0 operating_NN1 system_NN1 manages_VVZ all_DT0 activities_NN2 on_PRP the_AT0 computer_NN1 and_CJC the_AT0 interactions_NN2 with_PRP input_NN1 (_( keyboard_NN1 ,_, mouse_NN1 ,_, etc._AV0 )_) and_CJC output_NN1 devices_NN2 (_( printers_NN2 ,_, monitors_NN2 ,_, etc_AV0 ._. )_) ._. If_CJS you_PNP run_VVB a_AT0 program_NN1 like_PRP Microsoft_NP0 Word_NN1 ,_, the_AT0 operating_NN1 system_NN1 is_VBZ actually_AV0 managing_VVG how_AVQ you_PNP interact_VVB with_PRP Word_NN1 :_: how_AVQ you_PNP tell_VVB it_PNP what_DTQ font_NN1 to_TO0 use_VVI ,_, what_DTQ margins_NN2 you_PNP want_VVB ,_, and_CJC how_AVQ Word_NN1 prints_NN2 to_PRP the_AT0 printer_NN1 ._. An_AT0 operating_NN1 system_NN1 manages_VVZ all_DT0 :_: input_NN1 -_- getting_VVG information_NN1 into_PRP the_AT0 computer_NN1 from_PRP an_AT0 external_AJ0 source_NN1 such_PRP21 as_PRP22 the_AT0 keyboard_NN1 ,_, a_AT0 mouse_NN1 ,_, a_AT0 scanner_NN1 ,_, or_CJC a_AT0 disk_NN1 ._. processing_NN1 -_- after_PRP receiving_VVG input_NN1 ,_, the_AT0 computer_NN1 manipulates_VVZ or_CJC alters_VVZ the_AT0 data_NN0 ._.

  19. Text-analysis Software: WordSmith Tools Research Methodology • WordSmith Tools version 5.0 • Developed by Mike Scott (2007) • University of Liverpool, UK • www.lexically.net/wordsmith/index.html

  20. Reference Corpus Research Methodology According to Bowker and Pearson (2002), Hunston (2002), and Scott (2001): • To ensure the word’s ‘keyness’, the frequency wordlist of a corpus should be compared with a larger reference corpus. • With Log Likelihood Formula: Unusually frequent or infrequent words can be identified for their ‘keyness’ and the significance difference (p value) i.e.: • Words with positive keyness => occurs unusually more often. • Words with negative keyness => occurs unusually less often.

  21. Reference Corpus: BNC Research Methodology • British National Corpus (BNC) • A general corpus of 100 million words • Samples of written and spoken language from a wide range of sources • BNC website is http://www.natcorp.ox.ac.uk • In the present study, BNC wordlist is from WordSmith Tools

  22. Data Analysis and Findings The method of analysis is adapted from the suggestions of Bowker and Pearson (2002), and Scott (2001). • The method and findings are described according to the research questions. • What are high-frequency academic wordsin ICT specialized texts? • What are high-frequency technical words in ICT specialized texts? • What are high-frequency collocations in ICT specialized texts? • What are high-frequency verb tenses in ICT specialized texts? • What are high-frequency usages of verbs in a passive form • in ICT specialized texts?

  23. Data Analysis and Findings Question 1: What are high-frequency academic words in ICT specialized texts? 1.1 Download GSL and AWL wordlists from the website of the University of Hertfordshire, UK at http://www.uefap.com/vocab/vocfram.htm. Use these words as academic word candidates. 1,937 GSL Headwords 570 AWL Headwords

  24. Data Analysis and Findings 1.2 Build a wordlist of the EICT Corpus, resulting totally in 6064 word types. 1.3 Use academic word candidates to mark all GSL and AWL in the corpus. Lemmatize them, resulting in 941 headwords of academic word candidates with ≥ 5 occurrences. Sort in alphabetical order

  25. 1.4 Compare the list of academic word candidates with the list of BNC, using Log Likelihood Formula at the p value 0.000001. • The software is set: • To process with full lemma • To display only words with positive keyness Data Analysis and Findings

  26. Data Analysis and Findings Finding 1 From 941 words, 343words with ≥ 5 occurrences, positive keyness, and significance difference are cropped up as high-frequency academic words. Excluding function words Sort in alphabetical order Sort according to keyness

  27. Data Analysis and Findings Finding 1 • All 343 high-frequency academic words can be classified into 2 groups. • 246 academic words: • e.g. access, compute, illustrate indicate, identify, manipulate, • term, category, feature, occurrence, symbol etc. • 97 semi-technical words: • 2.1 Words with technical senses or particular meaning • e.g. burn, drive, refresh, card, domain, engine, memory, field • application, character, Word, document, window etc. • 2.2 Words in mathematics, geometric shape and diagram • e.g. add, multiply, divide, axis, table, row, degree etc. • 2.3. Simple words frequently used as command or method • e.g. edit, enable, paste, shift, help, enter, drag, drop etc.

  28. Data Analysis and Findings Question 2: What are high-frequency technical words in ICT specialized texts? Similarly to the method in Question 1: 2.1 Build word frequency list of the whole EICT Corpus. 2.2 Exclude all function words and high-frequency words in finding 1. 2.3 Lemmatize the remaining words, resulting in 938 headwords. 2.4 Keep only words with ≥ 5 occurrences and technical meanings. 2.5 Compare the resulting wordlist with BNC wordlist, using Log Likelihood at the p value 0.000001.

  29. Data Analysis and Findings Finding 2 From 938 words, 267words with ≥ 5 occurrences, positive keyness, and significance difference are selected to be high-frequency technical words. Sort according to keyness

  30. Data Analysis and Findings Finding 2 All 267 resulting words are classified into 5 groups: 1. 106 words with particular meanings (different from general meaning) e.g. cache, cookies, bus, port, bitmap, chip, cursor, pixel etc. 2. 61 abbreviations, acronyms, and extensions e.g. ASCII, WYSIWYG, ALU, ROM, RAM, OS, RGB, ESC, ALT txt, doc, gif, wav, http, html, www etc. 3. 70 words concerning programs, commands and keys e.g. spreadsheet, database, notepad, wizard, Telnet, Apple backspace, alternate, tab, deselect, browse, redo etc. 4. 17 words in mathematics, geometric shapes and diagram e.g. equation, ellipse, polygon, cell, column, intersection etc.

  31. 3.1 Set the software: • To produce concordances. • To display 2-5 word clusters with ≥ 5 co-occurrences • To compute the strength of relation between words, using Mutual Information(MI) ≥ 5.000 Data Analysis and Findings Question 3: What are high-frequency collocations in ICT specialized texts?

  32. Data Analysis and Findings 3.2 On the cluster tab, select only the 2-5 clusters with technical meaning and frequent uses.

  33. Data Analysis and Findings 3.3 Compute the relation value, on the collocate tab. Sort according to the relation value

  34. Data Analysis and Findings 3.4 Select only the collocations with ≥ 5 occurrences, MI scores ≥ 5.000, and distribution in ≥ 3 text files. • Collocations with technical meanings • e.g. QWERTY layout, recycle bin, peripheral device, • operating system (OS), uniform resource location • (URL), hypertext markup language (html) • Collocations with frequent use • e.g. refer to, (be) referred to as, (be) concerned with, • consist of, conform to, such as, in order, in order to (Note: on-going analysis)

  35. Data Analysis and Findings Question 4: What are high-frequency verb tenses in ICT specialized texts? 4.1 Using the tagged corpus, produce the concordances of verbs: e.g. For simple present: VVB = base form of lexical verb (except infinitive) e.g. take, leave VVZ = -s form or lexical verb e.g. takes, leaves For continuous tense: VVG = -ing form of lexical verb e.g. taking, leaving

  36. Data Analysis and Findings Samples of tagged concordances of ‘allow’ in simple present

  37. ‘Accept’ is a button. = noun ‘Access’ is a name of a database program. = noun Data Analysis and Findings 4.2. Before counting the frequency, check to ensure whether the concordances belong to the tense being studied.

  38. Data Analysis and Findings 4.3. Compare the frequency and the dispersion value of each tense. Sample dispersion of simple present which uniformly spreads in all subcorpora. Note: on-going analysis

  39. Data Analysis and Findings Question 5: What are high-frequency usage of verbs in passive form in ICT specialized texts? Similarly to the method in Question 4: 5.1 Use the tagged corpus, produce the concordances of past participle: i.e. VVN = past participle of lexical verb e.g. taken, left 5.2. Classify the usage of past participles as parts of tenses or as modifiers. 5.3 Compare the frequency and dispersion of each usage. (Note: on-going analysis)

  40. Discussion • Significance of the study: • Provide an overall idea about language description of English for ICT. • Provide a clear goal of language learning for serving particular learning needs. In materials design, teacher knows which language items should be focused on in designing lessons and which ones are already known by the students. Apart from typical teaching materials, a corpus itself can also be a great source of learning. It makes possible for students’ direct access to the corpus, which can promote data-driven learning.

  41. Bowker, L. and Pearson, J. (2002). Working with Specialized Language: A Practical Guide to Using Corpora. USA and UK: Routledge. Cobb, T. and Horst, M. (2001). Reading academic English: Carrying learners across the lexical threshold. In Flowerdew, J. and Peacock, M., (eds.) Research Perspectives on English for Academic Purposes. pp. 315-329. UK: Cambridge University Press. Coxhead, A. (1998). An Academic Word List. ELI occasional publication. No.18. Victory University of Wellington, New Zealand. Coxhead, A. and Nation, P. (2001). The specialized vocabulary of English for academic purposes. In Flowerdew, J. and Peacock, M. (ed.) Research Perspectives on English for Academic Purposes. pp. 252-267. UK: Cambridge University Press. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Laufer, B. 1989. What percentage of text-lexis is essential for comprehension? Cited in Cobb, T., and Horst, M. Reading academic English: Carrying learners across the lexical threshold. In Flowerdew, J., and Peacock, M., (eds.) Research perspectives on English for academic purposes, pp. 315-329. UK : Cambridge University Press, 2001. Mudraya, O. (2006). Engineering English: A lexical frequency instructional models. English for Specific Purposes. Volume 25 (2) pp.235-256. Elsevier Science. References

  42. References Nation, P. (2001). Learning Vocabulary in Another Language. Cambridge: Cambridge University Press. Nation, P. and Waring, R. (1997). Vocabulary size, text coverage and word lists. In Schmitt, N. and McCarthy, M. (eds.) Vocabulary: Description, Acquisition and Pedagogy. pp. 6-19. Cambridge: Cambridge University Press. Nurweni, A. and Read, J. (1999). The English vocabulary knowledge of Indonesian university students.English for Specific Purposes. Volume 18 (2) pp. 161 – 175. Elsevier Science. Scott, M. (2001). Comparing corpora and identifying key words, collocations, frequency distributions through the WordSmith Tools suite of computer programs. In Ghadessy, M., Henry, A., and Roseberry, R.L. (2001). Small Corpus Studies and ELT: Theory and Practice. pp. 47-67. US: John Benjamins Publishing. Scott, M. (2007). WordSmith Tools version 5.0. Oxford University Press. Available at http://www.lexically.net/wordsmith/index.html. Supatranont, P. (2005a). Classroom concordancing: Increasing vocabulary size for academic reading. KOTESOL Proceeding 2005. pp. 35-44. South Korea. Supatranont, P. (2005b). A Comparison of the Effects of the Concordance-based and the Conventional Teaching Methods on Engineering Students’ English Vocabulary Learning. Online Ph.D. Dissertation, Program of English as an International Language, Chulalongkorn University, Thailand. Available at http://www.arts.chula.ac.th/~ling/thesis/Pisamai2548.pdf West, M. (1953). A General Service List of English Words. London: Longman, Green and Company.

  43. Thank you for your attention. Any Questions?

More Related