1 / 32

Alexander Gelbukh Gelbukh

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Parallel computing can improve

apollo
Download Presentation

Alexander Gelbukh Gelbukh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming Alexander Gelbukh www.Gelbukh.com

  2. Previous Chapter: Conclusions • Parallel computing can improve • response time for each query and/or • throughput: number of queries processed with same speed • Document partitioning is simple • good for distributed computing • Term partitioning is good for some data structures • Distributed computing is MIMD computing with slow communication • SIMD machines are good for Signature files • Both are out of favor now

  3. Previous Chapter: Research topics • How to evaluate the speedup • New algorithms • Adaptation of existing algorithms • Merging the results is a bottleneck • Meta search engines • Creating large collections with judgements • Is recall important?

  4. Problem • Recall image retrieval: • Find images similar in color, size, ... • Find photos of Korean President ? • Find nice girls ? (Don’s show ugly ones!) • Looks very stupid • Lacks understanding • Too difficult • Text retrieval is no exception • Find stories with sad beginning and happy end ? • Lacks understanding • Difficult but possible

  5. Possible? • Text is intended to facilitate understanding • Supposedly, even partial understanding should help • Degrees of understanding: • Character strings (what is used now): well, geese, him • Words (often used now): goose, he • Concepts: hole in the ground (well), Roh Moo-Hyun • Complex concepts: oil well, hot dog • Situations (sentences, paragraphs) • The story (direct meaning) • The message (pragmatics, intended impact)

  6. Easy? • Main problems: • Multiple ways to say the same • Query does not match the doc • Difficult to specify all variants • Ambiguity of the text • False alarms in matching • Lack of implicit knowledge of the computer • The computer “does not understand” the message • Difficult to make inferences • Natural Language Processing tries to solve them

  7. Solutions • Multiple ways to say the same? • Normalizing: transforming to a “standard” variant • Ambiguity of the text? • Ambiguity resolution • Normalizing to one of the variants • Perhaps the main problem in natural language processing • Lack of implicit knowledge of the computer? • Dictionaries, grammars • Knowledge on language structure is needed in all tasks • Knowledge of world is useful for advanced task • Knowledge on language use is a substitute

  8. Synonymy • Multiple ways to say the same • Or at least when the difference does not matter • Can be substituted in any (many?) context • Lexical synonymy • Woman / female, professor / teacher • Dictionaries • Phrase-level or sentence-level synonymy • They game a book / I was given a book by them • Syntactic analyzers • Semantic-level synonymy • Reasoning

  9. Not only synonymy • Multiple ways to say • the same (synonymy) • less: more general (hypernymy) • more: more specific (hyponymy) • Complete synonyms are rare • professor teacher • Abbreviations are usually (almost) complete synonyms • When the differences do not matter, can be treated as synonymy • But: different data structures and methods

  10. Lexical-level synonymy • Lexical synonymy • Woman / female • Mixed-type synonymy: USA / United States • Morphology is a kind of synonymy (actually hyponymy) • ‘geese’ = ‘goose’ + ‘many’ • Russian ‘knigu’ = ‘kniga’ + ‘dative role’ • the “second” part of the meaning is either not important or is another term • Morphology is a very common problem in IR

  11. Lexical synonymy • Woman / female • Dictionaries • Synonym dictionaries • WordNet • Automatic learning of synonymy • Clustering of contexts • If the contexts are very similar, then possible synonyms • Problem: preserves meaning? Monday / Tuesday • An interesting solution: compare dictionary definitions

  12. Uses in IR • Query expansion • Add synonyms of the word to the query and process normally • Flexible, slow • Best for lexical synonymy: few synonyms, doubtful • Reducing at index time • When reading the documents, reduce each word to a “standard” synonym • Fast, rigid • Best for morphology: many synonyms, less doubtful • Hierarchical indexing

  13. Hierarchical indexing (Gelbukh, Sidorov, Guzman-Arenas 2002) • Tree of concepts • Living things • Animals • a. Cat, b. cats • a. Dog, b. dogs • Persons • a. Professor, b. professors • a. Student, b. students • Order vocabulary by the order of the leaves of tree • Query expansion is done by ranges: • cat: 1, living things: 1-4

  14. Morphology • One of the large concerns in IR • Can be done • precisely • approximately (quick-and-dirty) • Level of generalization • inflection: student – students • derivation: study – student • Ambiguity • all variants • one variant

  15. ... morphology • Result is • The unique ID • The dictionary form • A “stem”: part of the same string

  16. Morphological analyzers • Precise analysis • Ambiguous • Give all variants • Tables: to table or the table? • Spanish charlas: charla ‘talk’or charlar ‘to talk’ • Russian dush: dush ‘shower’ or dusha ‘soul’ • Common in languages with developed morphology • For short words, some 3 – 5 – 10 variants • Dictionaries are used

  17. Morphological system • Dictionary specifies: • Stem: bak-, ask- • POS (part of speech): verb • Inflection class (what endings it accepts): 1, 2 • Tables of endings specify • Paradigms: • -e -es -ed -ed -ing • -, -s -ed -ed -ing • Meanings: participle, ...

  18. ... morphological system • Algorithm • Decompose the word into an existing stem and ending • Check compatibility of stem and ending • Give the stem ID and ending meaning • Ambiguous • Many variants of decompositions • Many stems with different IDs • Many endings with different meaning • -ed: past or participle • Problem: words absent in dictionary

  19. Stemming • Substitute for real analysis • Both inflection and derivation • Quick-and-dirty • Only one variant • Result: a part of the string • gene, genialgen- • Cheap development • bad results • simple description. Standard • Often used in academic research • Used to be used in real systems, but now less

  20. Porter stemmer • Martin Porter, 1980 • Standard stemmer • Provides equal basisfor evaluation ofdifferent IR programs • Uses “measure” m: • [C](VC){m}[V]. • m=0 TR, EE, TREE, Y, BY. • m=1 TROUBLE, OATS, TREES, IVY. • m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

  21. ... Porter stemmer • Step 1a • SSES -> SS caresses -> caress • IES -> I ponies -> poni ties -> ti • SS -> SS caress -> caress • S -> cats -> cat

  22. ... Porter stemmer • Step 1b • (m>0) EED -> EE feed -> feed agreed -> agree • (*v*) ED -> plastered -> plaster bled -> bled • (*v*) ING -> motoring -> motor sing -> sing

  23. ... Porter stemmer • If 2nd or 3rd rule successful • AT -> ATE conflat(ed) -> conflate • BL -> BLE troubl(ed) -> trouble • IZ -> IZE siz(ed) -> size • (*d and not (*L or *S or *Z)) -> single letter • hopp(ing) -> hop • tann(ed) -> tan • fall(ing) -> fall • hiss(ing) -> hiss • fizz(ed) -> fizz • (m=1 and *o) -> E • fail(ing) -> fail • fil(ing) -> file

  24. ... Porter stemmer • Step 1c • (*v*) Y -> I • happy -> happi • sky -> sky

  25. ... Porter stemmer • Step 2 • (m>0) ATIONAL -> ATE relational -> relate • (m>0) TIONAL -> TION conditional -> condition rational -> rational • (m>0) ENCI -> ENCE valenci -> valence • (m>0) ANCI -> ANCE hesitanci -> hesitance • (m>0) IZER -> IZE digitizer -> digitize • (m>0) ABLI -> ABLE conformabli -> conformable • (m>0) ALLI -> AL radicalli -> radical • (m>0) ENTLI -> ENT differentli -> different • (m>0) ELI -> E vileli - > vile • (m>0) OUSLI -> OUS analogousli -> analogous • (m>0) IZATION -> IZE vietnamization -> vietnamize • (m>0) ATION -> ATE predication -> predicate • (m>0) ATOR -> ATE operator -> operate • (m>0) ALISM -> AL feudalism -> feudal • (m>0) IVENESS -> IVE decisiveness -> decisive • (m>0) FULNESS -> FUL hopefulness -> hopeful • (m>0) OUSNESS -> OUS callousness -> callous • (m>0) ALITI -> AL formaliti -> formal • (m>0) IVITI -> IVE sensitiviti -> sensitive • (m>0) BILITI -> BLE sensibiliti -> sensible

  26. ... Porter stemmer • Step 3 • (m>0) ICATE -> IC triplicate -> triplic • (m>0) ATIVE -> formative -> form • (m>0) ALIZE -> AL formalize -> formal • (m>0) ICITI -> IC electriciti -> electric • (m>0) ICAL -> IC electrical -> electric • (m>0) FUL -> hopeful -> hope • (m>0) NESS -> goodness -> good

  27. ... Porter stemmer • Step 4 • (m>1) AL -> revival -> reviv • (m>1) ANCE -> allowance -> allow • (m>1) ENCE -> inference -> infer • (m>1) ER -> airliner -> airlin • (m>1) IC -> gyroscopic -> gyroscop • (m>1) ABLE -> adjustable -> adjust • (m>1) IBLE -> defensible -> defens • (m>1) ANT -> irritant -> irrit • (m>1) EMENT -> replacement -> replac • (m>1) MENT -> adjustment -> adjust • (m>1) ENT -> dependent -> depend • (m>1 and (*S or *T)) ION -> adoption -> adopt • (m>1) OU -> homologou -> homolog • (m>1) ISM -> communism -> commun • (m>1) ATE -> activate -> activ • (m>1) ITI -> angulariti -> angular • (m>1) OUS -> homologous -> homolog • (m>1) IVE -> effective -> effect • (m>1) IZE -> bowdlerize -> bowdler

  28. ... Porter stemmer • Step 5a • (m>1) E -> probate -> probat rate -> rate • (m=1 and not *o) E -> cease -> ceas • Step 5b • (m > 1 and *d and *L) -> single letter • controll -> control • roll -> roll

  29. Statistical stemmers • Take a list of words • Construct a model of language that “generates” it • The “best” one • The simplest one? How to find? • List of stems, list of endings • Determine their probabilities • Usage statistics • Decompose any input string into a stem and an ending • Take the most probable variant

  30. Research topics • Constructing and application of ontologies • Building of morphological dictionaries • Treatment of unknown words with morphologicalanalyzers • Development of better stemmers • Statistical stemmers?

  31. Reducing synonyms can help IR Better matching Ontologies are used. WordNet Morphology is a variant of synonymy widely used in IR systems Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers Rule-based stemmers. Porter stemmer Statistical stemmers Conclusions

  32. Thank you! Till May 24? 25?, 6 pm

More Related