1 / 88

Search Engines & Question Answering

Search Engines & Question Answering. Giuseppe Attardi Università di Pisa. Topics. Web Search Search engines Architecture Crawling: parallel/distributed, focused Link analysis (Google PageRank) Scaling. Top Online Activities. Source: Jupiter Communications, 2000.

hina
Download Presentation

Search Engines & Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engines & Question Answering Giuseppe Attardi Università di Pisa

  2. Topics • Web Search • Search engines • Architecture • Crawling: parallel/distributed, focused • Link analysis (Google PageRank) • Scaling

  3. Top Online Activities Source: Jupiter Communications, 2000

  4. Pew Study (US users July 2002) • Total Internet users = 111 M • Do a search on any given day = 33 M • Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?Report=64

  5. Tampere weather Mars surface images Nikon CoolPix Search on the Web • Corpus:The publicly accessible Web: static + dynamic • Goal: Retrieve high quality results relevant to the user’s need • (not docs!) • Need • Informational – want to learn about something (~40%) • Navigational – want to go to that page (~25%) • Transactional – want to do something (web-mediated) (~35%) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Finland

  6. Results • Static pages (documents) • text, mp3, images, video, ... • Dynamic pages = generated on request • data base access • “the invisible web” • proprietary content, etc.

  7. Terminology http://www.cism.it/cism/hotels_2001.htm URL = Universal Resource Locator Access method Host name Page name

  8. Scale • Immense amount of content • 2-10B static pages, doubling every 8-12 months • Lexicon Size: 10s-100s of millions of words • Authors galore (1 in 4 hosts run a web server) http://www.netcraft.com/Survey

  9. Arts 14.6% Arts: Music 6.1% Computers 13.8% Regional: North America 5.3% Regional 10.3% Adult: Image Galleries 4.4% Society 8.7% Computers: Software 3.4% Adult 8% Computers: Internet 3.2% Recreation 7.3% Business: Industries 2.3% Business 7.2% Regional: Europe 1.8% … … … … Diversity • Languages/Encodings • Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01] • Home pages (1997): English 82%, Next 15: 13% [Babe97] • Google (mid 2001): English: 53%, JGCFSKRIP: 30% • Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000)

  10. Rate of change [Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 Mathematically, what does this seem to be?

  11. Web idiosyncrasies • Distributed authorship • Millions of people creating pages with their own style, grammar, vocabulary, opinions, facts, falsehoods … • Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages. • The open web is largely a marketing tool. • IBM’s home page does not contain computer.

  12. Other characteristics • Significant duplication • Syntactic - 30%-40% (near) duplicates [Brod97, Shiv99b] • Semantic - ??? • High linkage • ~ 8 links/page in the average • Complex graph topology • Not a small world; bow-tie structure [Brod00] • More on these corpus characteristics later • how do we measure them?

  13. Ill-defined queries Short AV 2001: 2.54 terms avg, 80% < 3 words) Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only (mostly above the fold) 78% of queries are not modified (one query/session) Follow links – “the scent of information” ... Web search users

  14. Evolution of search engines 1995-1997 AV, Excite, Lycos, etc • First generation -- use only “on page”, text data • Word frequency, language • Second generation -- use off-page, web-specific data • Link (or connectivity) analysis • Click-through data (What results people click on) • Anchor-text (How people refer to this page) • Third generation -- answer “the need behind the query” • Semantic analysis -- what is this about? • Focus on user need, rather than on query • Context determination • Helping the user • Integration of search and text analysis From 1998. Made popular by Google but everyone now Still experimental

  15. Third generation search engine: answering “the need behind the query” • Query language determination • Different ranking • (if query Japanese do not return English) • Hard & soft matches • Personalities (triggered on names) • Cities (travel info, maps) • Medical info (triggered on names and/or results) • Stock quotes, news (triggered on stock symbol) • Company info, … • Integration of Search and Text Analysis

  16. Answering “the need behind the query”Context determination • Context determination • spatial (user location/target location) • query stream (previous queries) • personal (user profile) • explicit (vertical search, family friendly) • implicit (use AltaVista from AltaVista France) • Context use • Result restriction • Ranking modulation

  17. The spatial context - geo-search • Two aspects • Geo-coding • encode geographic coordinates to make search effective • Geo-parsing • the process of identifying geographic context. • Geo-coding • Geometrical hierarchy (squares) • Natural hierarchy (country, state, county, city, zip-codes, etc) • Geo-parsing • Pages (infer from phone nos, zip, etc). About 10% feasible. • Queries (use dictionary of place names) • Users • From IP data • Mobile phones • In its infancy, many issues (display size, privacy, etc)

  18. AV barry bonds

  19. Lycos palo alto

  20. Geo-search example - Northern Light (Now Divine Inc)

  21. Helping the user • UI • spell checking • query refinement • query suggestion • context transfer …

  22. Context sensitive spell check

  23. Search Engine Architecture Document Store Web Page Repository Link Analysis Structure Snippet Extraction Indexer Ranking Crawlers Results Text Query Engine Crawl Control Queries

  24. Terms • Crawler • Crawler control • Indexes – text, structure, utility • Page repository • Indexer • Collection analysis module • Query engine • Ranking module

  25. Repository “Hidden Treasures”

  26. Storage • The page repository is a scalable storage system for web pages • Allows the Crawler to store pages • Allows the Indexer and Collection Analysis to retrieve them • Similar to other data storage systems – DB or file systems • Does nothave to provide some of the other systems’ features: transactions, logging, directory

  27. Storage Issues • Scalability and seamless load distribution • Dual access modes • Random access (used by the query engine for cached pages) • Streaming access (used by the Indexer and Collection Analysis) • Large bulk update – reclaim old space, avoid access/update conflicts • Obsolete pages - remove pages no longer on the web

  28. Designing a Distributed Web Repository • Repository designed to work over a cluster of interconnected nodes • Page distribution across nodes • Physical organization within a node • Update strategy

  29. Page Distribution • How to choose a node to store a page • Uniform distribution – any page can be sent to any node • Hash distribution policy – hash page ID space into node ID space

  30. Organization Within a Node • Several operations required • Add / remove a page • High speed streaming • Random page access • Hashed organization • Treat each disk as a hash bucket • Assign according to a page’s ID • Log organization • Treat the disk as one file, and add the page at the end • Support random access using a B-tree • Hybrid • Hash map a page to an extent and use log structure within an extent.

  31. Distribution Performance

  32. Update Strategies • Updates are generated by the crawler • Several characteristics • Time in which the crawl occurs and the repository receives information • Whether the crawl’s information replaces the entire database or modifies parts of it

  33. Batch vs. Steady • Batch mode • Periodically executed • Allocated a certain amount of time • Steady mode • Run all the time • Always send results back to the repository

  34. Partial vs. Complete Crawls • A batch mode crawler can • Do a complete crawl every run, and replace entire collection • Recrawl only a specific subset, and apply updates to the existing collection – partial crawl • The repository can implement • In place update • Quickly refresh pages • Shadowing, update as another stage • Avoid refresh-access conflicts

  35. Partial vs. Complete Crawls • Shadowing resolves the conflicts between updates and read for the queries • Batch mode suits well with shadowing • Steady crawler suits with in place updates

  36. Indexing

  37. The Indexer Module Creates Two indexes : • Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing • Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph

  38. The Link Analysis Module Uses the 2 basic indexes created by the indexer module in order to assemble “Utility Indexes” e.g. : A site index.

  39. Inverted Index • A Set of inverted lists, one per each index term (word) • Inverted list of a term: A sorted list of locations in which the term appeared. • Posting: A pair (w,l) where w is word and l is one of its locations • Lexicon: Holds all index’s terms with statistics about the term (not the posting)

  40. Challenges • Index build must be: • Fast • Economic (unlike traditional index buildings) • Incremental Indexing must be supported • Storage: compression vs. speed

  41. Index Partitioning A distributed text indexing can be done by: • Local inverted file (IFL) • Each nodes contain disjoint random pages • Query is broadcasted • Result is the joined query answers • Global inverted file (IFG) • Each node is responsible only for a subset of terms in the collection • Query sent only to the apropriate node

  42. Indexing, Conclusion • Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes) • Challenges: Incremental indexing and personalization

  43. Scaling

  44. Scaling • Google (Nov 2002): • Number of pages: 3 billion • Refresh interval: 1 month (1200 pag/sec) • Queries/day: 150 million = 1700 q/s • Avg page size:10KB • Avg query size: 40 B • Avg result size: 5 KB • Avg links/page: 8

  45. Size of Dataset • Total raw HTML data size: 3G x 10 KB = 30 TB • Inverted index ~= corpus = 30 TB • Using compression 3:1 20 TB data on disk

  46. Single copy of index • Index • (10 TB) / (100 GB per disk) = 100 disk • Document • (10 TB) / (100 GB per disk) = 100 disk

  47. Query Load • 1700 queries/sec • Rule of thumb: 20 q/s per CPU • 85 clusters to answer queries • Cluster: 100 machine • Total = 85 x 100 = 8500 • Document server • Snippet search: 1000 snippet/s • (1700 * 10 / 1000) * 100 = 1700

  48. Limits • Redirector: 4000 req/sec • Bandwidth: 1100 req/sec • Server: 22 q/s each • Cluster: 50 nodes = 1100 q/s = 95 million q/day

  49. Queries Hardware based load balancer Index servers Document servers Google Web Server Google Web Server Ad server Spell Checker Google Web Server Scaling the Index

  50. Web Server 1 Gb/s 100 Mb/s Index load balancer Index Server Network … Index Server 1 Index Server K Pool Network Intermediate load balancer 1 Intermediate load balancer N … … … SIS SIS SIS SIS SN SN S1 S1 Pool for shard 1 Pool for shard N Pooled Shard Architecture

More Related