1 / 55

Federated Facts and Figures

Federated Facts and Figures. Joseph M. Hellerstein UC Berkeley. Road Map. The Deep Web and the FFF An Overview of Telegraph Demo: Election 2000 From Tapping to Trawling A Taste of Policy and Countermeasures Delicious Snacks. Meet the “Deep Web”.

tova
Download Presentation

Federated Facts and Figures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federated Facts and Figures Joseph M. Hellerstein UC Berkeley

  2. Road Map • The Deep Web and the FFF • An Overview of Telegraph • Demo: Election 2000 • From Tapping to Trawling • A Taste of Policy and Countermeasures • Delicious Snacks

  3. Meet the “Deep Web” • Available in your browser, but not via hyperlinks • Accessed via forms (press the “submit” button) • Typically runs some code to generate data • E.g. call out to a database, or run some “servlet” • Pretty-print results in HTML • Dynamic HTML • Estimated to be >400x larger than the “surface web” • Not accessible in the search engines • Typically crawl hyperlinks only

  4. Federated Facts and Figures • One part of the deep web: more full-text documents • E.g. archived newspaper articles, legal documents, etc. • Figure out how to fetch these, the add to search engine • Various people working on this (e.g. CompletePlanet) • Another part: Facts and Figures • I.e. structured database data • Fetch is only the first challenge • Want to combine (“federate”) these databases • Want to search by criteria other than keywords • Want to analyze the data en masse • I.e. want full query power, not just search • Search was always easy • Ranking not clearly appropriate here

  5. Meet the FFF

  6. Meet the FFF

  7. Meet the FFF

  8. Meet the FFF

  9. http://telegraph.cs.berkeley.edu Telegraph • An adaptive dataflow system • Dataflow • siphon data from the deep web and other data pools • harness data streaming from sensors and traces • flow these data streams through code • Adaptive • sensor nets & wide area networks: volatile! • like Telegraph Avenue • needs to “be cool” with volatile mix from all over the world • adaptive techniques route data to machines and code • marriage of queries, app-level route/filter, machine learning • First apps • Facts and Figures Federation: Election 2000 • Continuous queries on sensor nets • Rich queries on Peer-to-Peer • Joe Hellerstein, Mike Franklin, & co.

  10. Dataflow Commonalities • Dataflow at the heart of queries and networks • Query engines move records through operators • Networks move packets through routers • Networked data-intensive apps an emerging middle ground • Database Systems: • High-function, high integrity, carefully administered. Compile intelligent query plans based on data models and statistical properties, query semantics. • Networks: • Low-function, high availability, federated administration. Adapt to performance variabilities, treat data and code as opaque for loose coupling.

  11. Long-Running Dataflows on the FFF • Not precomputed like web indexes • Need online systems & apps for online performance goals • Subject of prior work in CONTROL project • Combo of query processing, sampling/estimation, HCI 100% Online  Traditional Time

  12. Telegraph Architecture • Telegraph executes Dataflow Graphs • Extensible set of operators • With extensible optimization rules • Data access operators • TeSS: the Telegraph Screen Scraper • Napster/Gnutella readers • File readers • Data processing operators • Selections (filters), Joins, Drill-Down/Roll-Up, Aggregation • Adaptivity Operators • Eddies, STeMs, FLuX, etc.

  13. Screen Scraping: TeSS • Screen scrapers do two things: • Fetch: emulate a web user clicking • Parse: extract info from resulting HTML/XML • Somebody has to train the screen scraper • Need a separate wrapper for each site • Some research work on making this process semi-automatic • TeSS is an open-source screen-scraper • Available at http://telegraph.cs.berkeley.edu/tess • Written by a (superstar) sophomore! • Simple scripting interface targeted today • Moving towards GUI for non-technical users (“by example”)

  14. First Demo: Election 2000

  15. From Tapping to Trawling • Telegraph allows users to pose rich queries over the deep web • But sometimes would like to be more aggressive: • Preload a telegraph cache • Access a variety of data for offline mining • More (we’ll see soon!) • Want something like a webcrawler for FFF • But FFF is too big. • Want to “trawl” for interesting stuff hidden there.

  16. From Tapping to Trawling

  17. From Tapping to Trawling

  18. “1600PennsylvaniaAvenue, DC” From Tapping to Trawling Name Address DupElim Anywho Name Yahoo Maps Eddy Infospace Name Infospace Street “Smith”

  19. API Challenges in Trawling • Load APIs on the web today: service and silence • Various policies at the servers, hard to learn • No analogy to robots.txt (which is too limiting anyhow) • Feedback can be delayed, painful • Solutions • Be very conservative • Make out-of-band (human) arrangements • Both seem inefficient • Finding new sites to trawl is hard • Have to wrap them: fetch is easyish, parse hardish • XML will help a little here • Query? Or Update? Again, an API problem! • Imagine we auto-trawled AnyWho and WeSpamYou.com

  20. Trawling “Domains” • Can now collect lists of: • Names (First, Last), Addresses, Companies, Cities, States, etc. etc. • Can keep lists organized by site and in toto • Allows for offline mining, etc. • Q: Do webgraph mining techniques apply to facts and figures?

  21. Exploiting Enumerated Domains I • Can trawl any site on known domains! • Suddenly the deep web is not so hidden. • In essence, we expand our trawl • Can use pre-existing domains to trawl further • Or, can add new sites to the trawl process

  22. Exploiting Enumerated Domains II • Trawling gets a sample (signature) of a site’s content • Analogous to a random walk, but needs to be characterized better • Can identify that 2 sites have related subsets of domains • Helps with the query composition problem • Rich query interfaces tend to be non-trivial • What sites to use? How to combine them? • Imagine: • Traditional search engine experience to pick some sites • System suggests how to join the sites in a meaningful way • As you build the query, you always see incremental results • Refine query as the data pours in • Berkeley CONTROL project has been incremental queries • Blends search, query, browse and mine

  23. A Sampler of FFF Policy Issues • Statistical DB Security Issues • Facing the Power of the FFF • “False” combinations • Combination strength • What is trawling? • Copying? So what? • Akamai for the deep web? • Cracking?

  24. Sampler of Countermeasures • Trawl detection • And Distributed Trawl Detection • Metadata Watermarking • Provenance, Lineage, Disclaimers • Stockpiling Spam

  25. Delicious Snacks "Concepts are delicious snacks with which we try to alleviate our amazement” -- A. J. Heschel, Man Is Not Alone

  26. Technical Snacks • Adaptive Dataflow • Systems + Learning • Incremental & continuous querying • And online, bounded trawling • Adds an HCI component to the above • FFF APIs, standards • The wrapper-writing bottleneck: XML? • Backoff APIs? • Search vs. Update • Mining trawls

  27. More Technical Snacks • Tie-ins with Security • Applications beyond FFF • Sensors • P2P • Overlay Networks

  28. Policy Questions • Presenting & Interpreting Data • Not just search • Privacy: What is it, what’s it for? • Leading Indicators from the FFF

  29. More? • http://telegraph.cs.berkeley.edu • jmh@cs.berkeley.edu • Collaborators: • Mike Franklin, Hal Varian -- UCB • Lisa Hellerstein & Torsten Suel -- Polytechnic • Sirish Chandrasekaran, Amol Deshpande, Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah -- UCB

  30. Backup Slides

  31. Telegraph: Adaptive Dataflow • Mixed design philosophy: • Tolerate loose coupling and partial failure • Adapt online and provide best-effort results • Learn statistical properties online • Exploit knowledge of semantics via extensible optimization infrastructures • Target new networked, data-intensive applications

  32. Adaptive Systems: General Flavor Repeat: • Observe (model) environment • Use observation to choose behavior • Take action

  33. Adaptive Dataflow in DBs: History • Rich But Unacknowledged History • Codd's data independence predicated on adaptivity! • adapt opaquely to changing schema and storage • Query optimization does it! • statistics-driven optimization • key differentiator between DBMSs and other systems

  34. Adaptivity in Current DBs • Limited & coarse grain Repeat: • Observe (model) environment • runstats (once per week!!): model changes in data • Use observation to choose behavior • query optimization: fixes a single static query plan • Take action • query execution: blindly follow plan

  35. What’s So Hard Here? • Volatile regime • Data flows unpredictably from sources • Code performs unpredictably along flows • Continuous volatility due to many decentralized systems • Lots of choices • Choice of services • Choice of machines • Choice of info: sensor fusion, data reduction, etc. • Order of operation • Maintenance • Federated world • Partial failure is the common case • Adaptivity required!

  36. Adaptive Query Processing Work • Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97] • Per Query: Mariposa [SA+96], ASE [CR94] • Competition: RDB [AZ96] • Inter-Op: [KD98], Tukwila [IF+99] • Query Scrambling: [AF+96,UFA98] • Survey: Hellerstein, Franklin, et al., DE Bulletin 2000 Competition & Sampling Query Scrambling Ingres DECOMP Inter-Operator Late Binding Future Work Per Query System R Eddies Frequency of Adaptivity

  37. A Networking Problem!? • Networks do dataflow! • Significant history of adaptive techniques • E.g. TCP congestion control • E.g. routing • But traditionally much lower function • Ship bitstreams • Minimal, fixed code • Lately, moving up the foodchain? • app-level routing • active networks • politics of growth • assumption of complexity = assumption of liability

  38. Networking Code as Dataflow? • States & Events, Not Threads • Asynchronous events natural to networks • State machines in protocol specification and system code • Low-overhead, spreading to big systems • Totally different programming style • remaining area of hacker machismo • Eventflow optimization • Can’t eventflow be adaptively optimized like dataflow? • Why didn’t that happen years ago? • Hold this thought

  39. Programming model: iterators old idea, widely used in DB query processing object with three methods: Init(), GetNext(), Close() input/output types query plan: graph of iterators pipelining: iterators that return results before children Close() Query Plans are Dataflow Too

  40. Clever Dataflow Tricks • Volcano: “exchange” iterator [Graefe] • encapsulate exchange logic in an iterator • not in the dataflow system • Box-and-arrow programming can ignore parallelism

  41. Some Solutions We’re Focusing On • Rivers • Adaptive partitioning of work across machines • Eddies • Adaptive ordering of pipelined operations • Quality of Service • Online aggregation & data reduction: CONTROL • MUST have app-semantics • Often may want user interaction • UI models of temporal interest • Data Dissemination • Adaptively choosing what to send, what to cache

  42. River • Berkeley built the world-record sorting machine • On the NOW: 100 Sun workstations + SAN • Only beat the record under ideal conditions • No such thing in practice! • (Arpaci-Dusseau)2 • with Culler, Hellerstein, Patterson • River: adaptive dataflow on clusters • One main idea: Distributed Queues • adaptive exchange operator • Simplifies management and programming • Remzi Arpaci-Dusseau, Eric Anderson, Noah Treuhaft • w/Culler, Hellerstein, Patterson, Yelick

  43. Q River

  44. Multi-Operator Query Plans • Deal with pipelines of commutative operators • Adapt at finer granularity than current DBMSs

  45. Continuous Adaptivity: Eddies • A pipelining tuple-routing iterator • just like join or sort or exchange • Works best with other pipelining operators • like Ripple Joins, online reordering, etc. • Ron Avnur & Joe Hellerstein Eddy

  46. Continuous Adaptivity: Eddies • How to order and reorder operators over time • based on performance, economic/admin feedback • Vs.River: • River optimizes each operator “horizontally” • Eddies optimize a pipeline “vertically” Eddy

  47. Continuous Adaptivity: Eddies • Adjusts flow adaptively • Tuples routed through ops in different orders • Visit each op once before output • Naïve routing policy: • All ops fetch from eddy as fast as possible • A la River • Turns out, doesn’t quite work • Only measures rate of work, not benefit • Lottery-based routing • Uses “lottery scheduling” to address a bandit problem • Kris Hildrum, et al. looking at formalizing this • Various AI students looking at Reinforcement Learning • Competitive Eddies • Throw in redundant data access and code modules!

More Related