1 / 18

Search Engines

Search Engines. What Are They?. Four Components A database of references to webpages An indexing robot that crawls the WWW An interface Enables users to submit queries Displays results Information retrieval system Each is unique, but are mostly the same. Database.

Download Presentation

Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engines

  2. What Are They? • Four Components • A database of references to webpages • An indexing robot that crawls the WWW • An interface • Enables users to submit queries • Displays results • Information retrieval system • Each is unique, but are mostly the same

  3. Database • Where user's query is matched • Contains only essential parts of pages • Only includes pages that were indexed • Search engines are always out of date

  4. Web Crawler • A robot that follows links • Records data it finds • Words in the webpage • Metadata • ALTattributes in IMG tags • Robot Exclusion Protocol

  5. Search Engine Interfaces • Gathers input from users • Presents results from the IR system • Often in ranked order

  6. Search Engine Interfaces • Input • User requirements • Search expression, search limits • Presentation style • Presentation format , search type

  7. Search Engine Interfaces • Output • Results • Descriptions • Clusters

  8. Search Term Matching • Trying to find a match in the database • Two main methods • Keyword searching • Matching single terms, computing cosine • Concept-based searching • Examining clusters of words • Attempt to determine meaning of query and find records related to that meaning

  9. Basic IR Features • Boolean operators • AND, OR, NOT, grouping • Extended operators • NEAR, ADJACENT, (") • Stop word deletion • Stemming • Searching in fields (e.g. host)

  10. Ranked Output • Most SEs produce ranked lists by applying simple rules: • Early words are more important • Title is very important • Frequency of occurrence matters for some • Infrequent words matter more • Modification date • Google is different: • PageRankTM method based on popularity • Links as money

  11. Googlebombing • Google spoofed from the lecture list • first hit from 1992 • Official GoogleBlog explanation

  12. What about the Invisible Web? • Also known as the Deep Web • Documents that are on the WWW but not indexed by Search Engines • Some are available only by submitting forms • Some are not generally accessible (in subnets) • Some are not in (X)HTML format

  13. The Invisible Web Isn't So Invisible Anymore… • More search engines parse non-(X)HTML now than before • Because of awareness of the problem companies are making more content available using • Stable URLs • Robot-friendly sitemaps • But much content is still not indexed

  14. But, there's still plenty of important yet invisible docs • How to find them? • Many of them are in databases • No one search engine covers everything • Use database tools from the U.'s library • Especially for research articles • Use multiple search engines or a meta-crawler • dogpile is the most famous

  15. Search Engines A Summary of Practical Advice

  16. How To Succeed With SEs • As a surfer: • If you don't know what you are looking for • Use multiple SEs, or a meta-crawler • Search within results • If you don't know what you are looking for • Use multiple SEs, or a meta-crawler • Use Boolean expressions or search within results • Consider specialized engines

  17. How To Succeed With SEs • As a creator: • HTML level • Always use ALT attributes with <IMG>, etc. • Avoid frames • Make it easier to index • Don't expect SEs to find your pages • Make links between your pages • Use metadata • Informal: <meta name="description" …> • Formal: Dublin core and others • Increase your pages popularity • Don’t use systematic reciprocal linking: rings, exchanges, lists • Page Rank™ is inversely proportional to outdegree

  18. How To Succeed With SEs • As a creator (cont.) • For surfers: • Use <meta name="description" …> • Don't expect surfers to start at top of your hierarchy • Don't rely on a hierarchy • Include a context map near the top of each page • Don't use frames • Think through dynamic content implications • Stickiness… is for another day

More Related