search and query an over re view l.
Skip this Video
Loading SlideShow in 5 Seconds..
Search and Query: An {Over, Re}view PowerPoint Presentation
Download Presentation
Search and Query: An {Over, Re}view

Loading in 2 Seconds...

play fullscreen
1 / 25

Search and Query: An {Over, Re}view - PowerPoint PPT Presentation

  • Uploaded on

Search and Query: An {Over, Re}view. Joseph M. Hellerstein Computer Science Division UC Berkeley Background. My interests: Database & Information Systems Tech: Indexing, Query Processing, Dataflow, Parallel/Distributed/Federated Systems

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Search and Query: An {Over, Re}view

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
search and query an over re view

Search and Query: An {Over, Re}view

Joseph M. Hellerstein

Computer Science Division

UC Berkeley

  • My interests: Database & Information Systems
    • Tech: Indexing, Query Processing, Dataflow, Parallel/Distributed/Federated Systems
    • Apps: Interactive Data Analysis, eMarkets, eCatalogs, Structured Querying on Internet, Querying Sensors
  • My (lack of) agenda for today
    • Examples to defamiliarize today’s modes of operation
    • Technical breakdown/review of system components
      • Core/common vs. Value-Added
    • Natural policy discussion
      • Core = “free” and “fair”?
      • Value-Added = “expensive” and “market-driven”?
      • Conflicting needs of core scaling and policy?
a note of caution
A Note of Caution
  • Contemplating a redesign? Fine. But…
  • Familiarity often trumps other desiderata
    • Memories are short, cross-fertilization is poor
    • Even in the technical community
    • Trap: common culture/techniques:
      • “What’s hot”: everybody knows web searching
      • “What’s packaged”: Database technology = Oracle
      • Rather than “What’s core technology”
  • In Expanding DNS, be sure to review known technology
    • What’s core under the packaging of related technologies?
    • Reuse/generalize common ideas
    • Leave specialized ideas above the core (minimize core constraints)
    • Good way to evaluate “brand new” ideas, too
      • How do they challenge the core? Is the core “right”?
beware of today s common culture
Beware of Today’s Common Culture
  • Rise of the web => Rise of Full Text Search
    • Used to be a library app (yawn)
    • Suddenly the dominant paradigm (??)
  • Beware: Internet  Web!
    • DNS is not document-centric
    • Neither are Peer-to-Peer filesharing or “Deep Web” Facts and Figures
    • Keyword search may not be core functionality
  • If DNS is to serve many needs…
    • Must separate core search services from varying apps!
      • Try to accommodate what’s hot tomorrow
    • What’s a core service? What’s an app?
road map
Road Map
  • Defamiliarization
    • Some new Internet “search” apps
      • Provided by Telegraph
  • An {over, re}view
    • Some basics of querying
    • Some separation of the issues
  • Discussion…
peer to peer filesharing
Peer-to-Peer Filesharing
  • Napster, Gnutella, Aimster, etc.
    • And a pile of ill-defined startups
    • This is not the web, but it is part of the Internet
    • Lessons already!
  • Distributed files and search by substring
    • Isn’t that heaven?!
    • Would like to handle spelling, language translation (pig latin), etc.
      • A layer of canonicalization (“nameprep”)
  • Well, what about these translations…
    • I post my name mapping scheme on my website
    • Download the Billboard Top Ten
    • Download tunes I’ve paid for
  • No problem…
telenap amazon meets napster
TeleNap: Amazon Meets Napster

OpenNap Servers

Album Information

search vs query
“Search” vs. Query
  • “Search” can return only what’s been stored
  • E.g., best match at iWon, Google, AskJeeves top ten:
search vs query10
“Search” vs. Query
  • “Search” can return only what’s been stored
  • E.g., best match at iWon, Google, AskJeeves top ten:
query search http fff cs berkeley edu
Query >> Search:
  • “Federated Facts and Figures”
    • Yahoo join FECInfo
  • Imagine DNS analogies:
    • Find IP addresses of hosts registered by companies that donated to GW Bush
      • FECInfo join WHOIS
  • Note: errors!
    • Can do fuzzy matching and ranking, too
query search http fff cs berkeley edu13
Query >> Search:
  • “Federated Facts and Figures”
    • APBNews join FECInfo
  • Imagine DNS analogies:
    • Show # of live nodes, Roll Up By Domain
    • Show Avg Ping time, Roll Up By Domain
    • Show Best Ping Time per Group, Roll Up By Domain
slippery slope of mapping
Slippery Slope of “Mapping”
  • Step 1: Exact Match Lookups
  • Step 2: Simple canonicalization
    • E.g. toupper()
  • Step 3: Complex canonicalization
    • E.g. “nameprep”
  • Step 4: Join w/Arbitrary Mapping Tables
    • Centralized “keywords” (RealNames)
    • Personalized mapping, “bookmarks”
    • Geo-located mapping tables for Localization
    • Load-balancing mapping via net status queries
    • Ad-hoc mapping: ad hoc database join queries!
      • E.g. “The lightest-loaded machine containing >5 of the Billboard Top Ten Songs”
  • Q: How much of this belongs in DNS??
slippery slope of querying
Slippery Slope of Querying
  • Step 1: Single collection, exact lookup
  • Step 2: Single collection, fancy lookup
    • “Search”
  • Step 2: Combine multiple collections
    • E.g. Mapping (join), “Metasearch” (union)
  • Step 3: Roll-up/drill-down
  • Step 4: Statistical analyses
  • Step 5: “Turing-complete” queries
    • Prolog/Datalog
  • How much of this belongs in DNS??
    • Note: Step 5 was even rejected in databases!
where does it end
Where Does it End??
  • All this functionality seems plausibly useful
    • Some of it even necessary!
  • But what should be in network infrastructure?
    • Vs. centralized, replicated services
    • Who pays for these services?
querying 101

e.g. Rewriting in canonical form, Choosing AMs and Processing Options.

Very different techniques in Databases,

Search Engines, though not mutually


Query Optimization

e.g. Joins, Intersect, Union, Rank/Sort, Group/Aggregate (Stat. Summaries)

Used in Databases, Search Engines

Parallelized on Well-Maintained Clusters

Bulk Data Processing

Access Methods

e.g. B-trees, Hash Indexes

Used in Directories, Databases,

Search Engines. Can be distributed.

Querying 101

This is Core and Common to

All Query systems.

access methods
Access Methods
  • Local search:
    • Main-memory data structures
      • binary trees, hashtables, skip lists, etc.
    • Disk-based data structures
      • B-trees, linear hash indexes, etc.
    • Typically equality and/or range lookups
  • Distributed search
    • Flat partitioning (hash, range), w/replication
    • Hierarchical partitioning
    • More recent multi-hop search & replication
      • E.g. CAN, Chord, PAST, Tapestry, Pastry
      • Equality lookups only (so far), no need for hierarchies
      • Proposed as a DNS replacement by networking researchers
data processing
Data Processing
  • Flow the data through processing code
  • Absent in Directory Services
  • Used in database systems (in the box)
    • Ad hoc combinations of operators possible
  • Constrained use in text search engines (in the box)
    • 1 collection: (word, docID, position, score, …)
      • Indexed by stemmed word
    • OR, AND, NOT: Union/Intersect/Subtract
    • Map docIDs to URLs, snippets, etc.
    • Sort by a (magic) function of position, score, etc.
    • Text search is one (highly tuned!) database query
  • Fun research: genericize this dataflow technology
    • Goal of Telegraph project: adaptive dataflow (out of the box)
      • Cluster-based implementation
      • Distributed (P2P) implementation
query optimization
Query Optimization
  • Text Search
    • Query rewrite: stemming, stop words, thesaurus
    • Scheduling: which machine(s) on cluster
    • Based on data partitioning, load & data statistics
  • Database Systems
    • Query rewrite: authorization, “views”
    • Choices among (redundant) access methods
    • Choices among data processing algorithms (joins)
    • Choices of reorderings for these algorithms
    • Scheduling: which machines(s) on cluster
    • Based on data partitioning, load & data statistics
  • Lots of fancy tricks here!
and what about storage semantics
And what about storage semantics??
  • So far we only looked at the query side
    • Query results only as good as the data!
    • Again, varying solutions here…
replication data consistency
Replication & Data Consistency
  • Databases do Transactions
    • Atomic, durable updates across multiple records
    • Data consistency guaranteed
    • Distributed transactions possible, but slow
      • Two-Phase Commit
    • Most people do “warm” replication
      • Log-shipping w/xactional networking -- MQ
    • Heavyweight technology!
      • Brewer’s CAP “theorem”
  • Directories tend to use “leases” (TTL)
    • Tend to be per record or collection
      • Cross-object consistency not guaranteed
    • A little drift is often OK
    • In a scenario with mapping, may need to think about atomicity across records/tables
one view core vs apps
One View: Core vs. Apps
  • Ensure that core services:
    • Scale to large # of machines (~size of Internet)
    • Work in presence of failures
    • Well-defined standard results: no surprises/ambiguity!
      • E.g. SQL subsets, Boolean text search
      • Need not be a user in the loop!
    • Unanticipated scenarios in future
      • Support ad hoc queries -- think cross-paradigm
  • Apps:
    • Scale to clusters, need not scale to size of Internet
    • Allow for mapping/customization
      • Results can be preference-dependent
      • What about time-/geo-dependent??
    • Allow result browsing, analysis
      • Fuzzy results, roll-ups, summaries all OK: user in the loop!
    • Impose a query paradigm
  • Telegraph:
  • Federated Facts and Figures: