160 likes | 234 Views
Explore the fundamentals of search engines, user interaction, indexing, and measures of success in information retrieval. Delve into precision, recall, timeliness, scalability, and the challenges faced. Learn about structured and unstructured information, text processing, and search engine politeness policies.
E N D
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for?{week 9} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
What is search? • What is search? • What are we searching for? • How many searches areprocessed per day? • What is the average number ofwords in text-based searches?
Finding things • Applications and varieties of search: • Web search • Site search • Vertical search • Enterprise search • Desktop search • As-you-type search • Proximity search search
Measures of success (i) • Relevance • Search results contain informationthe searcher was looking for • Problems with vocabulary mismatch • Homonyms (e.g. “Jersey shore”) • User relevance • Search results relevant to one usermay be completely irrelevant toanother user SNOOKI
Measures of success (ii) http://trec.nist.gov • Precision • Proportion of retrieved documentsthat are relevant • How precise were the results? • Recall (and coverage) • Proportion of relevant documentsthat were actually retrieved • Did we retrieve all of the relevant documents?
Measures of success (iii) • Timeliness and freshness • Search results contain information thatis current and up-to-date • Performance • Users expect subsecond response times • Media • User devices are constantly changing (cellphones, mobile devices, tablets, etc.)
Measures of success (iv) • Scalability • Designs that perform equally well as thesystem grows and expands • Increased number of documents, number of users, etc. • Flexibility (or adaptability) • Tune search engine components tokeep up with changing landscape • Spam-resistance
Information retrieval (IR) • Gerard Salton (1927-1995) • Pioneer in information retrieval • Defined information retrieval as: • “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” • This was 1968 (before the Internet and Web!)
(Un)structured information • Structured information: • Often stored in a database • Organized via predefinedtables, columns, etc. • Select all accounts with balances less than $200 • Unstructured information • Document text (headings, words, phrases) • Images, audio, video (often relies on textual tags)
Processing text • Search and IR has largelyfocused on text processingand documents • Search typically uses thestatistical properties of text • Word counts • Word frequencies • But ignore linguistic features (noun, verb, etc.)
Politeness and robots.txt • Web crawlers adhere to a politeness policy: • GET requests sent every few seconds or minutes • A robots.txt filespecifies whatcrawlers areallowed to crawl:
Sitemaps default priority is 0.5 some URLs might not be discovered by crawler
A day in the life of a crawler what about checkingfor updated pages?
Freshness vs. age • Freshness is essentially a Boolean value • Age measures the degree to which crawled page is out of date