430 likes | 648 Views
SET (4). Prof. Dragomir R. Radev radev@cs.columbia.edu. SET Fall 2013. … 6. Automated indexing/labeling Compression …. Indexing methods. Manual: e.g., Library of Congress subject headings, MeSH Automatic: e.g., TF*IDF based. LOC subject headings.
E N D
SET(4) Prof. Dragomir R. Radev radev@cs.columbia.edu
SET Fall 2013 … 6. Automated indexing/labeling Compression …
Indexing methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic: e.g., TF*IDF based
LOC subject headings A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html
Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine
Automatic methods • TF*IDF: pick terms with the highest TF*IDF scores • Centroid-based: pick terms that appear in the centroid with high scores • The maximal marginal relevance principle (MMR) • Related to summarization, snippet generation
Compression • Methods • Fixed length codes • Huffman coding • Ziv-Lempel codes
Fixed length codes • Binary representations • ASCII • Representational power (2k symbols where k is the number of bits)
Variable length codes • Alphabet: A .- N -. 0 ----- B -... O --- 1 .---- C -.-. P .--. 2 ..--- D -.. Q --.- 3 ...— E . R .-. 4 ....- F ..-. S ... 5 ..... G --. T - 6 -.... H .... U ..- 7 --... I .. V ...- 8 ---.. J .--- W .-- 9 ----. K -.- X -..- L .-.. Y -.— M -- Z --.. • Demo: • http://www.scphillips.com/morse/
Most frequent letters in English • Most frequent letters: • E T A O I N S H R D L U • Demo: • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm • Also: bigrams: • TH HE IN ER AN RE ND AT ON NT
Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37.5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h
Exercise • Consider the bit string: 01101101111000100110001110100111000110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.
Extensions • Word-based • Domain/genre dependent models
Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment
<0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
Links on text compression • Data compression: • http://www.data-compression.info/ • Calgary corpus: • http://en.wikipedia.org/wiki/Calgary_Corpus • Huffman coding: • http://www.compressconsult.com/huffman/ • http://en.wikipedia.org/wiki/Huffman_coding • LZ • http://en.wikipedia.org/wiki/LZ77
SIDEBAR: 100 alternative search engines • http://www.readwriteweb.com/archives/top_100_alternative_search_engines.php
SET Fall 2013 … 7. Approximate string matching …
Levenshtein edit distance • Examples: • Theatre-> theater • Ghaddafi->Qadafi • Computer->counter • Edit distance (inserts, deletes, substitutions) • Edit transcript • Done through dynamic programming
Recurrence relation • Three dependencies • D(i,0)=i • D(0,j)=j • D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] • Simple edit distance: • t(i,j) = 0 iff S1(i)=S2(j)
Example Gusfield 1997
Example (cont’d) Gusfield 1997
Tracebacks Gusfield 1997
Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics • Homology information • BLAST • Blosum
Links • Web site: • http://odur.let.rug.nl/~kleiweg/lev/ • Demo: • /home/cs6998/tools/editDistance/dp/l.pl theater theatre
Other methods • Cosine • Generation probabilities (language modeling) • (exp)KL-divergence
SET Fall 2013 … 8. Query expansion Relevance feedback …
Query expansion • Corpus-based: mine query logs • NLP-based • Vector-space relevance feedback
Relevance feedback • Problem: initial query may not be the most appropriate to satisfy a given information need. • Idea: modify the original query so that it gets closer to the right documents in the vector space
Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • R = ? • S = ? • Q’ = ?
Pseudo relevance feedback • Automatic query expansion • Thesaurus-based expansion (e.g., using latent semantic indexing – later…) • Distributional similarity • Query log mining
Examples Lexical semantics (Hypernymy): Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer:adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper
Examples (query logs) • Book: booksellers, bookmark, blue • Computer: sales, notebook, stores, shop • Fruit: recipes cake salad basket company • Games: online play gameboy free video • Politician: careers federal office history • Newspaper: online website college information • Schools: elementary high ranked yearbook • California: berkeley san francisco southern • French: embassy dictionary learn
Final projects • Two formats: • A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. • A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences. • Deliverables: • System (code + documentation + examples) or Paper (+ code, data) • Poster (to be presented in class) • Web page that describes the project.
Readings • 4: MRS15, MRS16 • 5: MRS17 • 6: MRS18, MRS19