1 / 43

SET (4)

SET (4). Prof. Dragomir R. Radev radev@cs.columbia.edu. SET Fall 2013. … 6. Automated indexing/labeling Compression …. Indexing methods. Manual: e.g., Library of Congress subject headings, MeSH Automatic: e.g., TF*IDF based. LOC subject headings.

yachi
Download Presentation

SET (4)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SET(4) Prof. Dragomir R. Radev radev@cs.columbia.edu

  2. SET Fall 2013 … 6. Automated indexing/labeling Compression …

  3. Indexing methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic: e.g., TF*IDF based

  4. LOC subject headings A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html

  5. Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine

  6. Automatic methods • TF*IDF: pick terms with the highest TF*IDF scores • Centroid-based: pick terms that appear in the centroid with high scores • The maximal marginal relevance principle (MMR) • Related to summarization, snippet generation

  7. Compression • Methods • Fixed length codes • Huffman coding • Ziv-Lempel codes

  8. Fixed length codes • Binary representations • ASCII • Representational power (2k symbols where k is the number of bits)

  9. Variable length codes • Alphabet: A .-  N -.  0 ----- B -...  O ---  1 .---- C -.-.  P .--.  2 ..--- D -..  Q --.-  3 ...— E .  R .-. 4 ....- F ..-. S ... 5 ..... G --. T -  6 -.... H .... U ..-  7 --... I ..  V ...-  8 ---.. J .---  W .--  9 ----. K -.-  X -..- L .-..  Y -.— M --  Z --.. • Demo: • http://www.scphillips.com/morse/

  10. Most frequent letters in English • Most frequent letters: • E T A O I N S H R D L U • Demo: • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm • Also: bigrams: • TH HE IN ER AN RE ND AT ON NT

  11. Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37.5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

  12. 0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h

  13. Exercise • Consider the bit string: 01101101111000100110001110100111000110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.

  14. Extensions • Word-based • Domain/genre dependent models

  15. Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment

  16. <0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

  17. Links on text compression • Data compression: • http://www.data-compression.info/ • Calgary corpus: • http://en.wikipedia.org/wiki/Calgary_Corpus • Huffman coding: • http://www.compressconsult.com/huffman/ • http://en.wikipedia.org/wiki/Huffman_coding • LZ • http://en.wikipedia.org/wiki/LZ77

  18. SIDEBAR: 100 alternative search engines • http://www.readwriteweb.com/archives/top_100_alternative_search_engines.php

  19. SET Fall 2013 … 7. Approximate string matching …

  20. Levenshtein edit distance • Examples: • Theatre-> theater • Ghaddafi->Qadafi • Computer->counter • Edit distance (inserts, deletes, substitutions) • Edit transcript • Done through dynamic programming

  21. Recurrence relation • Three dependencies • D(i,0)=i • D(0,j)=j • D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] • Simple edit distance: • t(i,j) = 0 iff S1(i)=S2(j)

  22. Example Gusfield 1997

  23. Example (cont’d) Gusfield 1997

  24. Tracebacks Gusfield 1997

  25. Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics • Homology information • BLAST • Blosum

  26. Links • Web site: • http://odur.let.rug.nl/~kleiweg/lev/ • Demo: • /home/cs6998/tools/editDistance/dp/l.pl theater theatre

  27. Other methods • Cosine • Generation probabilities (language modeling) • (exp)KL-divergence

  28. SET Fall 2013 … 8. Query expansion Relevance feedback …

  29. Query expansion

  30. Query expansion • Corpus-based: mine query logs • NLP-based • Vector-space relevance feedback

  31. Relevance feedback • Problem: initial query may not be the most appropriate to satisfy a given information need. • Idea: modify the original query so that it gets closer to the right documents in the vector space

  32. Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

  33. Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • R = ? • S = ? • Q’ = ?

  34. Pseudo relevance feedback • Automatic query expansion • Thesaurus-based expansion (e.g., using latent semantic indexing – later…) • Distributional similarity • Query log mining

  35. Examples Lexical semantics (Hypernymy): Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer:adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

  36. Examples (query logs) • Book: booksellers, bookmark, blue • Computer: sales, notebook, stores, shop • Fruit: recipes cake salad basket company • Games: online play gameboy free video • Politician: careers federal office history • Newspaper: online website college information • Schools: elementary high ranked yearbook • California: berkeley san francisco southern • French: embassy dictionary learn

  37. [Otterbacher et al. HLT EMNLP 2005]

  38. Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.

  39. Readings • 4: MRS15, MRS16 • 5: MRS17 • 6: MRS18, MRS19

More Related