1 / 50

Representing the Meaning of Documents

Representing the Meaning of Documents. LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik. Agenda. The structure of interactive IR systems Character sets Terms as units of meaning Strings and segments Tokens and words Phrases and entities Senses and concepts

Download Presentation

Representing the Meaning of Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik

  2. Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course

  3. What do We Mean by “Information?” • How is it different from “data”? • Information is data in context • Databases contain data and produce information • IR systems contain and provide information • How is it different from “knowledge”? • Knowledge is a basis for making decisions • Many “knowledge bases” contain decision rules

  4. What Do We Mean by “Retrieval?” • Find something that you want • The information need may or may not be explicit • Known item search • Find the class home page • Answer seeking • Is Lexington or Louisville the capital of Kentucky? • Directed exploration • Who makes videoconferencing systems?

  5. Global Internet User Population 2000 2005 English English Chinese Source: Global Reach

  6. Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose

  7. IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

  8. Representing Electronic Texts • A character set specifies semantic units • Characters are the smallest units of meaning • Abstract entities, separate from their representation • A font specifies the printed representation • What each character will look like on the page • Different characters might be depicted identically • An encoding is the electronic representation • What each character will look like in a file • One character may have several representations • An input method is a keyboard representation

  9. Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course

  10. The character ‘A’ • ASCII encoding: 7 bits used per character 0 0 0 0 0 1 0 1 = 65 DEC (decimal) 0 1 0 0 0 0 0 1 = 65 DEC (decimal) • Number of representable characters: 27 = 128 distinct characters including 0 (NUL) • Some character codes used for non-visible characters, e.g. 7 = control-G = BEL

  11. | 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII • Widely used in the U.S. • American Standard Code for Information Interchange • ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |

  12. Geeky Joke for the Day • Why do computer geeks confuse Halloween and Christmas? • Because 31 OCT = 25 DEC! • 031 OCT = 0*82 + 3*81 + 1*80 octal = 0*102 + 2*101 + 5*100 decimal

  13. The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe • French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

  14. Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5

  15. East Asian Character Sets • More than 256 characters are needed • Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets • GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages • Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records

  16. Unicode • Goal is to unify the world’s character sets • ISO Standard 10646 • Character set and encoding scheme separated • Full “code space” is used by character codes • Extends Latin-1 • UTF-7 encoding will pass through email • Originally designed for 64 printable ASCII characters • UTF-8 encoding works with disk file systems

  17. Limitations of Unicode • Produces much larger files than Latin-1 • Fonts are hard to obtain for many characters • Some characters have multiple representations • e.g., accents can be part of a character or separate • Some characters look identical when printed • But they come from unrelated languages • The sort order may not be appropriate

  18. Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course

  19. Strings and Segments • Retrieval is (often) a search for concepts • But what we index are character strings • What strings best represent concepts? • In English, words are often a good choice • But well chosen phrases can be even better • In German, compounds may need to be split • Otherwise queries using constituent words would fail • In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech • This segmentation problem is similar to that of speech

  20. Longest Substring Segmentation • A greedy segmentation algorithm • Based solely on lexical information • Start with a list of every possible term • Dictionaries are a handy source for term lists • For each unsegmented string • Remove the longest single substring in the list • Repeat until no substrings are found in the list • Can be extended to explore alternatives

  21. Longest Substring Example • Possible German compound term: • washington • List of German words: • ach, hin, hing, sei, ton, was, wasch • Longest substring segmentation • was-hing-ton • A language model might see this as bad • Roughly translates to “What tone is attached?”

  22. Probabilistic Segmentation • For an input word c1 c2 c3 …cn • Try all possible partitions into w1 w2w3 … • c1c2 c3 …cn • c1c2 c3 c3 …cn • c1 c2 c3 …cnetc. • Choose the highest probability partition • E.g., compute Pr(w1 w2w3) using a language model • Challenges: search, probability estimation

  23. Non-Segmentation: N-gram Indexing • Consider a Chinese document c1 c2 c3 …cn • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term • _c1 c2 ,c2 c3 ,c3 c4 ,… , cn-1 cn • Break up queries the same way

  24. Tokens and Words • What is a word? • Kindergarten • Aux armes! • Doug’s running • Realistic review resubmit • Morphology: • How morphemes combine to make words • Morphemes are units of meaning • Remember antidisestablishmentarianism? • Anti (disestablishmentarian) ism

  25. Morphemes and Roots • Inflectional morphology • Preserves part of speech • Destructions = Destruction+PLURAL • Destroyed = Destroy+PAST • Derivational morphology • Relates parts of speech • Destructor = AGENTIVE(destroy) • Can help IR performance, but expensive • Getting derivational morphology right is hard • {peninsula,insulate}:insula (Lat. “island”) ???

  26. Stemming • Stem: in IR, a word equivalence class that preserves the main concept. • Often obtained by affix-stripping (Porter, 1980) • {destroy, destroyed, destruction}: destr • Inexpensive to compute • Usually helps IR performance • Can make mistakes! (over-/understemming) • {centennial,century,center}: cent • {acquire,acquiring,acquired}: acquir {acquisition}: acquis

  27. Roots and Stems: beyond English • Arabic: alselam • Stem: selam • Root: SLM (peace) • Semantic families: altaliban • Stem: taliban (student) • Root: TLB (question) • Current research on best level of analyis

  28. Phrases and Entities • Multi-word combinations identify entities • The president, Dubya, George W. Bush • Can also identify relationships of interest • Derek Jones, CEO of SadAndBankrupt.com,… • Entity roles, filling slots in templates

  29. Named Entity Identification • Major categories of named entities • Influenced by text genres of interest… mostly news • Person, organization, location, date, money, … • Decent algorithms based on finite automata • Best algorithms based on supervised learning • Annotate a corpus identifying entities and types • Train a probabilistic model • Apply the model to new text

  30. Example: Predictive Annotation for Question Answering In reality, at the time of Edison’s 1879 patent, the light bulb PERSON TIME had been in existence for some five decades …. Who patented the light bulb? patent light bulb PERSON When was the light bulb patented? patent light bulb TIME In what year was the light bulb patented? ??? What did Thomas Edison patent?

  31. General Phrase Identification • Two types of phrases • Compositional: meaning derived from parts • Noncompositional: idiomatic expressions • e.g., “kick the bucket” or “buy the farm” • Three sources of evidence • Dictionary lookup • Parsing • Co-occurrence

  32. Known Phrases • Same idea as longest substring match • But look for word (not character) sequences • Compile a term list that includes phrases • Technical terminology can be very helpful • Index any phrase that occurs in the list • Most effective in a limited domain • Otherwise hard to capture most useful phrases

  33. Syntactic Phrases • Automatically construct sentence diagrams • Fairly good parsers are available • Index the noun phrases • Assumes that queries will focus on objects Sentence Prepositional Phrase Noun Phrase Noun phrase Det Adj Adj Noun Verb Prep Det Adj Adj Noun The quick brown fox jumped over the lazy dog’s back

  34. Syntactic Variations • The “paraphrase problem” • Prof. Douglas Oard studies information access patterns. • Doug studies patterns of user access to different kinds of information. • Transformational variants (Jacquemin) • Coordinations • lung and breast cancer  lung cancer • Substitutions • inflammatory sinonasal disease  inflammatory disease • Permutations • addition of calcium  calcium addition

  35. Phrase Discovery: Collocations • Compute observed occurrence probability • For each single word and each word n-gram • “buy” 10 times in 1000 words yields 0.01 • “the” 100 times in 1000 words yields 0.10 • “farm” 5 times in 1000 words yields 0.005 • “buy the farm” 4 times in 1000 words yields 0.004 • Compute n-gram probability if truly independent • 0.01*0.10*0.005=0.000005 • Compare with observed probability • Record phrases that occur more often than expected

  36. Phrase Indexing Lessons • Poorly chosen phrases hurt effectiveness • And some techniques can be slow (e.g., parsing) • Better to index phrases and words • Want to find constituents of compositional phrases • Better weighting schemes  less benefit • Negligible improvement in some TREC systems • Very helpful for cross-language retrieval • Noncompositional translation, reduced ambiguity

  37. Cross-Language IR and Phrases • Poser: quite ambiguous (Langenscheidt) • Place, put (a question, a motion) • Lay down (a principle) • Hang (curtains) • Set (a problem) • Poser une question: meaning is clear! • Ask a question • In this case, better to use the phrase • But is this really about phrases?

  38. Senses and Concepts • What is a word sense? • Entry in a dictionary or thesaurus • Position or cluster in a semantic space • What is word sense disambiguation? • Identifying intended sense(s) from context • Goal for IR • Match on the intended concept, not just the words

  39. Problems With Word Matching • Word matching suffers from two problems • Synonymy: paper vs. article • Homonymy: bank (river) vs. bank (financial) • Disambiguation in IR: seek to resolve homonymy • Index word senses rather than words • Synonymy usually addressed by • Thesaurus-based query expansion • Latent semantic indexing

  40. Word Sense Disambiguation • Context provides clues to word meaning • “The doctor removed the appendix.” • For each occurrence, note surrounding words • Typically +/- 5 non-stopwords • Group similar contexts into clusters • Based on overlaps in the words that they contain • Separate clusters represent different senses

  41. Disambiguation Example • Consider four example sentences • The doctor removed the appendix • The appendix was incomprehensible • The doctor examined the appendix • The appendix was removed • What clusters can you find? • Can you find enough word senses this way? • Might you find too many word senses?

  42. Why Disambiguation Hurts • Bag-of-words techniques already disambiguate • When more words are present, documents rank higher • So a context for each term is established in the query • Formal disambiguation tries to improve precision • But incorrect sense assignments would hurt recall • Hard to distinguish homonymy from fine-grained polysemy • Average precision balances recall and precision • But the possible precision gains are small • And current techniques substantially hurt recall

  43. Where Could Disambiguation Help? • Categorization of whole documents • Identifying location(s) in a topic hierarchy • Visualization • People are good at seeing signal amidst noise • Probabilistic models • Combining different sources of evidence • (Requires n-best rather than 1-best responses)

  44. Summary • The goal is to index the right meaning units • Start by finding fundamental features • Characters or shape codes (for OCR) etc. • Combine them into easily recognized units • Words where possible, character n-grams otherwise • Consider alternatives to splitting or forming phrases • But stemming is generally a good idea • Usually best to match those units directly • Disambiguation strategies hurt more than they help

  45. Agenda • The structure of interactive IR systems • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • A few words about the course

  46. Course Goals • Appreciate IR system capabilities and limitations • Understand IR system design & implementation • For a broad range of applications and media • Evaluate IR system performance • Identify current IR research problems

  47. Course Design • Text/readings provide background and detail • At least one recommended reading is required • Class provides organization and direction • We will not cover every important detail • Assignments and project provide experience • The TA can help CLIS students with the project • Final exam helps focus your effort

  48. Grading • Assignments (15%) • Mastery of concepts and experience using tools • 796: “homework,” 828o: “programming” • Term project (796: 50%, 828o: 30%) • Options are described on course Web page • Final exam (796: 35%, 828o: 55%) • Two different in-class exams

  49. Handy Things to Know • Classes will be videotaped • Available in the CLIS library if you miss class • Office hours are by appointment • Send an email, or ask after class • Everything is on the Web • At http://www.glue.umd.edu/~oard/teaching.html • Doug is most easily reached by email • oard@umd.edu

  50. Some Things to Do This Week • At least skim the readings before class • Don’t fall behind! • Look at assignment 1 • Due in 2 weeks! • Explore the Web site • Start thinking about the term project

More Related