1 / 50

Content Analysis and Statistical Properties of Text

Content Analysis and Statistical Properties of Text. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Today. Overview of Content Analysis Text Representation

neorah
Download Presentation

Content Analysis and Statistical Properties of Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Information Organization and Retrieval

  2. Today • Overview of Content Analysis • Text Representation • Statistical Characteristics of Text Collections • Zipf distribution • Statistical dependence Information Organization and Retrieval

  3. Content Analysis • Automated Transformation of raw text into a form that represent some aspect(s) of its meaning • Including, but not limited to: • Automated Thesaurus Generation • Phrase Detection • Categorization • Clustering • Summarization Information Organization and Retrieval

  4. Techniques for Content Analysis • Statistical • Single Document • Full Collection • Linguistic • Syntactic • Semantic • Pragmatic • Knowledge-Based (Artificial Intelligence) • Hybrid (Combinations) Information Organization and Retrieval

  5. Text Processing • Standard Steps: • Recognize document structure • titles, sections, paragraphs, etc. • Break into tokens • usually space and punctuation delineated • special issues with Asian languages • Stemming/morphological analysis • Store in inverted index (to be discussed later) Information Organization and Retrieval

  6. Index Query Parse Rank Pre-process Information need Collections How is the query constructed? How is the text processed? text input

  7. Document Processing Steps Information Organization and Retrieval

  8. Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • tengo, tienes, tiene, tenemos, tienen • Derivational Morphology • Derive one word from another, • Often change grammatical class • build, building; health, healthy Information Organization and Retrieval

  9. Automated Methods • Powerful multilingual tools exist for morphological analysis • PCKimmo, Xerox Lexical technology • Require a grammar and dictionary • Use “two-level” automata • Stemmers: • Very dumb rules work well (for English) • Porter Stemmer: Iteratively remove suffixes • Improvement: pass results through a lexicon Information Organization and Retrieval

  10. Errors Generated by Porter Stemmer (Krovetz 93) Information Organization and Retrieval

  11. Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution Information Organization and Retrieval

  12. A More Standard Collection Government documents, 157734 tokens, 32259 unique 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Information Organization and Retrieval

  13. Plotting Word Frequency by Rank • Main idea: count • How many times tokens occur in the text • Over all texts in the collection • Now rank these according to how often they occur. This is called the rank. Information Organization and Retrieval

  14. Most and Least Frequent Terms Rank Freq Term1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form 150 2 enhanc 151 2 energi 152 2 emphasi 153 2 detect 154 2 desir 155 2 date 156 2 critic 157 2 content 158 2 consider 159 2 concern 160 2 compon 161 2 compar 162 2 commerci 163 2 clause 164 2 aspect 165 2 area 166 2 aim 167 2 affect Information Organization and Retrieval

  15. The Corresponding Zipf Curve Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form Information Organization and Retrieval

  16. Zoom in on the Knee of the Curve 43 6 approach44 5 work45 5 variabl46 5 theori47 5 specif48 5 softwar49 5 requir50 5 potenti51 5 method52 5 mean53 5 inher54 5 data55 5 commit56 5 applic57 4 tool58 4 technolog59 4 techniqu Information Organization and Retrieval

  17. Zipf Distribution • The Important Points: • a few elements occur veryfrequently • a medium number of elements have medium frequency • manyelements occur very infrequently Information Organization and Retrieval

  18. Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant • Rank = order of words’ frequency of occurrence • Another way to state this is with an approximately correct rule of thumb: • Say the most common term occurs C times • The second most common occurs C/2 times • The third most common occurs C/3 times • … Information Organization and Retrieval

  19. Zipf Distribution(linear and log scale) Information Organization and Retrieval

  20. What Kinds of Data Exhibit a Zipf Distribution? • Words in a text collection • Virtually any language usage • Library book checkout patterns • Incoming Web Page Requests (Nielsen) • Outgoing Web Page Requests (Cunha & Crovella) • Document Size on Web (Cunha & Crovella) Information Organization and Retrieval

  21. Related Distributions/”Laws” • Bradford’s Law of Scattering • Lotka’s Law of Productivity • De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50% Pick 2/3 = 66% ¾ = 75% Pick Replace +1 Replace +1 Information Organization and Retrieval

  22. Very frequent word stems (Cha-Cha Web Index) Information Organization and Retrieval

  23. Words that occur few times (Cha-Cha Web Index) Information Organization and Retrieval

  24. Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. • Called “stop words” in IR • Usually correspond to linguistic notion of “closed-class” words • English examples: to, from, on, and, the, ... • Grammatical classes that don’t take on new members. • There are always a large number of tokens that occur once and can mess up algorithms. • Medium frequency words most descriptive Information Organization and Retrieval

  25. Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive. Information Organization and Retrieval

  26. Statistical Independence vs. Statistical Dependence • How likely is a red car to drive by given we’ve seen a black one? • How likely is the word “ambulence” to appear, given that we’ve seen “car accident”? • Color of cars driving by are independent (although more frequent colors are more likely) • Words in text are not independent (although again more frequent words are more likely) Information Organization and Retrieval

  27. Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together. Information Organization and Retrieval

  28. Statistical Independence and Dependence • What are examples of things that are statistically independent? • What are examples of things that are statistically dependent? Information Organization and Retrieval

  29. Lexical Associations • Subjects write first word that comes to mind • doctor/nurse; black/white (Palermo & Jenkins 64) • Text Corpora yield similar associations • One measure: Mutual Information (Church and Hanks 89) • If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection) Information Organization and Retrieval

  30. Statistical Independence • Compute for a window of words a b c d e f g h i jk l m n o p w1 w11 w21 Information Organization and Retrieval

  31. Interesting Associations with “Doctor”(AP Corpus, N=15 million, Church & Hanks 89) Information Organization and Retrieval

  32. Un-Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun. Information Organization and Retrieval

  33. Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse Information Organization and Retrieval

  34. Document VectorsOne location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Information Organization and Retrieval

  35. Document VectorsOne location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I Information Organization and Retrieval

  36. Document Vectors Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 Information Organization and Retrieval

  37. We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Information Organization and Retrieval

  38. Documents in 3D Space Information Organization and Retrieval

  39. Content Analysis Summary • Content Analysis: transforming raw text into more computationally useful forms • Words in text collections exhibit interesting statistical properties • Word frequencies have a Zipf distribution • Word co-occurrences exhibit dependencies • Text documents are transformed to vectors • Pre-processing includes tokenization, stemming, collocations/phrases • Documents occupy multi-dimensional space. Information Organization and Retrieval

  40. Index Query Parse Rank Pre-process Information need Collections text input How is the index constructed?

  41. Inverted Index • This is the primary data structure for text indexes • Main Idea: • Invert documents into a big index • Basic steps: • Make a “dictionary” of all the tokens in the collection • For each token, list all the docs it occurs in. • Do a few things to reduce redundancy in the data structure Information Organization and Retrieval

  42. Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows Information Organization and Retrieval

  43. How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Information Organization and Retrieval

  44. How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically. Information Organization and Retrieval

  45. How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. Information Organization and Retrieval

  46. How Inverted Files are Created • Then the file can be split into • A Dictionary file and • A Postingsfile Information Organization and Retrieval

  47. How Inverted Files are Created Dictionary Postings Information Organization and Retrieval

  48. Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms Information Organization and Retrieval

  49. How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings Information Organization and Retrieval

  50. Next Time • Term weighting • Statistical ranking Information Organization and Retrieval

More Related