Search Engine Technology (1). Prof. Dragomir R. Radev email@example.com. SET FALL 2013. … Introduction … … … …. Examples of search engines. Conventional (library catalog). Search by keyword, title, author, etc .
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Prof. Dragomir R. Radev
AOL query logs
Political science corpus
News data: aquaint, tdt, nantc, reuters, setimes, trec, tipster
US congressional data
Twitter hits 400 million tweets per day (June, 2012. Dick Costolo, CEO at Twitter)
2. Models of Information retrieval
The Vector model
The Boolean model
In what year did baseball become an offical sport?
play station codes . com
birth control and depression
where can I find a chines rosewood
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?
From Robert Korfhage’s book
NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
3. Document preprocessing.
The Porter algorithm.
Storing, indexing and searching text.
Example: the word “duplicatable”
duplicat rule 4duplicate rule 1b1duplic rule 3
The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.
SSES SS caresses caress
IES I ponies poni
SS SS caress caress
S [blank] cats cat
1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:
b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6
3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair