Maximizing Text Corpus Analysis Efficiency

Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Corpus-Based Work • Text Corpora are usually big. They also need to be representative samples of the population of interest. • Corpus-Based work involves collecting a large number of counts from corpora that need to be access quickly. • There exists some software for processing corpora (see useful links on course homepage).

Corpora • Linguistically mark-up or not • Representative sample of the population of interest • American English vs. British English • Written vs. Spoken • Areas • The performance of a system depends heavily on • the entropy • Text categorization • Balanced corpus vs. all text available

Software/Coding • Software • Text editor • Regular expression • Programming language • C/C++, Perl, awk, Python, Prolog, Java • Coding • Mapping words to numbers • Hashing • CMU-Cambridge Statistical Language Modeling toolkit

Looking at Text (I)Low-Level Formatting Issues • Mark-up of a text • Formatting mark-up or explicit mark-up • Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins. • Uppercase and Lowercase: should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

Looking at Text (II): TokenizationWhat is a Word? • An early step of processing is to divide the input text into units called tokenswhere each is either a word or something else like a number or a punctuation mark. • Periods: haplologiesor end of sentence? • White spaces • Periods : etc., 먹었다 하였다. 6.7, 3.1절 • Single apostrophes: isn’t, I’ll  2 words ? 1 words • Hyphenation: text-based, co-operation, e-mail, A-1-plus paper, “take-it-or-leave-it”, the 90-cent-an-hour raise, mark up  mark-up  mark(ed) up • Homographs --> two lexemes :: “saw” • 26.3$, www.hyowon.pusan.ac.kr, MicroSoft, :-), “책, ‘그’ 책”

Looking at Text (III): TokenizationWhat is a Word (Cont’d)? • Word Segmentation in other languages: no whitespace ==> words segmentation is hard • whitespace not indicating a word break. • New York, data base • the New York-New Haven railroad • variant coding of information of a certain semantic type. • +45 43 48 60 60, (202) 522-2230, 33 1 34 43 32 26, (44.171) 830 1007 • Speech corpora. • er, um,

Morphology • Stemming: Strips off affixes. • sit, sits, sat • Lemmatization: transforms into base form (lemma, lexeme) • Disambiguation • Not always helpful in English (from an IR point of view) which has very little morphology. • !! Stemming does not help the performance of classical IR • business  busy • Perhaps more useful in other contexts. • Mutilpe words  a morpheme ??? • Richer inflectional and derivational system • Bantu language: KiHaya • akabimu’ha (a-ka-bi-mu’-ha, 1SG-PAST-3PL-3SG-give) • I gave them to him. • Finnish • Millions of inflected forms for each verb

Sentences: What is a sentence?” • Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases. • Sometimes, however, sentences are split up by other punctuation marks or quotes. • Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done. • “You remind me,” she remarked, “of your mother.” • 우리말은 더욱 어려움!!! • 마침표가 없기도 하고  종결형 어미 뒤? • 연결형 어미이면서 종결형 어미 • 따옴표

End-of-Sentence Detection (I) • Place EOS after all . ? ! (maybe ;:-) • Move EOS after quotation marks, if any • Disqualify a period boundary if: – Preceeded by known abbreviation followed by upper case letter, not normally sentence-final: e.g., Prof. vs. Mr.

End-of-Sentence Detection (II) – Precedeed by a known abbreviation not followed by upper case: e.g., Jr. etc. (abbreviation that is sentence-final or medial) • Disqualify a sentence boundary with ? or ! If followed by a lower case (or a known name) • Keep all the rest as EOS

Marked-Up Data I: Mark-up Schemes • Schemes developed to mark up the structure of text • Different Mark-up schemes: – COCOA format (older, and rather ad-hoc) – SGML [other related encodings: HTML, TEI, XML] • DTD, XML Scheme

Marked-Up Data II: GrammaticalCoding • Tagging indicates the various conventional parts of speech. Tagging can be done automatically (we will talk about that in Week 9). • Different Tag Sets have been used: e.g., Brown Tag Set, Penn Treebank Tag Set. • Table 4.4, 4.5 설명 • The Design of a Tag Set: Target Features versus Predictive Features. • 국내 tag-set에 대해 설명 • 보조용언과 본용언 구별을 위한 예로 설명 • ETRI, KAIST, …

Maximizing Text Corpus Analysis Efficiency