370 likes | 508 Views
Why Search Engines are used increasingly to Offload Queries from Databases. Bjørn Olstad CTO FAST Search & Transfer Adjunct Prof. The Norwegian University of Science & Technology Email: bjorn.olstad@fast.no Cell: +47 48011157. The Typo Problem. Talent Offloading .
 
                
                E N D
Why Search Engines are used increasingly to Offload Queries from Databases • Bjørn Olstad • CTO FAST Search & Transfer • Adjunct Prof. The Norwegian University of Science & Technology • Email: bjorn.olstad@fast.no • Cell: +47 48011157
The RDBMS Experience High input barrier ”You are viewing 5 random jobs out of 2461 jobs in total....”
1 CareerBuilderUse scenario, part 1 30956 jobs
2 CareerBuilderUse scenario, part 2 1084 jobs
3 CareerBuilderUse scenario, part 3 30 jobs
CareerBuilderUse scenario, part 4 5 jobs 30956  5 targeted jobs in 3 steps
Challenger Shuttle Launch Fax to NASA from contractor with O-ring concern
ESP: Cleansing, Mining, Relevance and Discovery IYP: A Disruptive Change Taylor or Gibson guitar? Good local offers? Compare offerings Phone / Directions BTW: I’m using my iPAQ What is the phone numberto Will’s Barber shop? Product &ServicesBlogs++ Companyweb site
Search ISVs: A Disruptive Change Siebel 2000 Siebel 2005 “my” CRM Application “my” CRM Application Information Access Layer 3rd party content Search is a tactical afterthought Search is a strategic enabler
Relational algebralarge – but “finite”data sets structured data SQL-70 Oracle-79 SQL-89 SQL-92 SQL-99 Search & Explore focused“infinite”data sets Unstructured & Structured GIGABYTES SQL-03 Revisit the Assumptions … 2003: 24B 2002: 12B Cave paintings,Bone tools 40,000 BCE Writing 3500 BCE 2001: 6B 0 C.E. Paper 105 2000: 3B Printing 1450 Electricity, Telephone 1870 80% Unstructured Transistor 1947 Computing 1950 Internet (DARPA) Late 1960s The Web 1993 1999
Extreme Capabilities? • Feeding/streaming, transaction, retrieval or analytics centric? • Content size: M, L, VL, VVVL or Vn∞ L? • Schema centric, Semi-structured XML, Text, Agnostic? • Fuzzy & Value vs. Binary & Completeness? • Discovery primitives? • User interaction part of design target?
ESP The Result: • #1: FAST ESP w/ disk • Mean = 99 [ms] • St.dev. = 36 [ms] • #2: Oracle w/ memory mapping • Mean = 4 057 [ms] • St.dev. = 9 368 [ms] RDBMS Query LatencyRDBMS vs ESP Test Data: • Structured data: • 5 million records; • 13 fields per record • Structured queries: • 22 SQL queries( Representative in ERP )
Query Per SecondRDBMS vs ESP QPS Identical HW : single node, 2 CPU, 4GB ram 3 SCSI disks Identical data : auction data from eBay, 3.6 million doc’s Identical queries: 200 queries defined by Oracle
Relational Model Disruptive Change • Star, snowflake schemas++ • Cubes / datamarts ++ Incremental fixes to painful shortcomings Adds complexity Queries that fit The Model Queries that don’t fit The Model Alternative I Alternative II • Schema agnostic • Scalable ad-hoc querying • BLOBS  Contextual Insight • Real-time fusion of disparate data models • Massive fault tolerant scalability
Contextual Insight Value/Noise SNR User Interaction ContextualRefinement Extreme CapabilitiesESP Design Targets Powering Search Derivative Applications (SDAs) Game Changer driven by Extreme Retrival and on-the-fly Analytics
ESP Database Query OffloadingExample: AutoTrader.com RDBMS: • HW-cost: $320K (32CPU on 4 Sun servers) • 90% sub-second query responseAverage = 12 s for the rest …. • Relevance = Sorting • 5 FTE to maintain ESP: • HW-cost: $90K • 100% sub-second query response • Flexible relevance and discovery • 0.5 FTE to maintain Car Dealers - Product Supply
Content ScalabilityRDBMS vs ESP Examples of ESP deployments • Compliance case: • 50B documents @ 80k average •  4 PB (around 100 web indexes) • Storage: • Intelligent content addressable storage • XML metadata and full content • EMC Centera: N * 256TB (N=1..400) • Webmining – Webfountain: • 60.000 : 1 in query capacity (ESP : DB)
Intelligent StorageStorage and Search Unite Discover Simple Scalable Secure
From ACCESS To INSIGHT Contextual Search • “Best of Web”Recommender / Authority • “Best of Enterprise”Linguistic / Statistic Any new supiciousfinancial transactionpatterns? Where is the emailfrom Peter aboutROI analysis? FIND EXPLORE Contextual Relevance Contextual Navigation • Contextual fact discovery • On-the-fly meta-dataanalysis
Turning around the PyramidHBZ.de – Leading German Library Service Center From: Librarians To: Researchers Single Field Search Quering FAST ESP WWW (HTML, XML, WML, JavaScript) SQL LIB … DB DB DB DB DB STRUCTURED
ESP @ SCOPUS • >200M articles / 180M citations • 180TB capacity / 14000 journals David Goodman standing up and declaring in public, that Scopus is the best-designed database he's ever seen …
Relevance Drives Revenue Search Reduces Clicks to Purchase and Browsing… … and Drives Revenue • Reduced # of clicks to buy content from > 4 to < 2 • 50% reduction in ringtone browsing • 100% increase in search • 20% increase in ringtone revenue Launched search Launched search 4.50 140% 140% 4.00 120% 120% 3.50 100% 100% 3.00 Search page views per sale 80% 80% 2.50 Clicks to Purchase 2.00 60% 60% 1.50 40% 40% 1.00 Revenue 20% 20% 0.50 0.00 0% 0% -20% -20% Week 1 Week 10 Week 1 Week 10 -40% -40% -60% -60% Browsing
ØKOKRIM Business AnalyticsProcessing of real-time streams Example: Norwegian Customs Foreign Exchange Transaction Monitoring SECURITY ACCESS MODULE ACL Monitor User Monitor Real-time Registration Queries MessageQueue Results Alerts Database connector Transaction Log Data Validation Firewall Firewall
Business IntelligenceESP vs. RDBMS Technology OBSERVATIONThe Enterprise Search Platform (ESP), a relatively new concept, integrating advanced technologies typically associated with search engines, database tools, and analytical systems, is fast becoming able to solve modern business intelligence problems (using both structured and unstructured data) in a way that is fundamentally different from, and ultimately superior to, that of other currently available analytical or database software. PREDICTIONEnterprise Search Platform and search centric application technology represents a true paradigm shift in the way data will be stored, analyzed and reported on in the future. Resulting realignments in the marketplace may be both rapid and tumultuous. - Chief strategist leading BI vendor
If your only tool is a hammer .... ... every problem looks like a nail
Text  Structure <Category>FINANCIAL</ Category > <Author>George Stein</ Author > BC-dynegy-enron-offer-update5 Dynegy May Offer at Least $8 Bln to Acquire Enron (Update5) By George Stein SOURCEc.2001 Bloomberg News BODY <Company>Dynegy Inc</Company> <Person>Roger Hamilton</Person> <Company>John Hancock Advisers Inc.</Company> <PersonPositionCompany> <OFFLENOFFSET="3576" LENGTH="63" /> <Person>RogerHamilton</Person> <Position>moneymanager</Position> <Company>John Hancock Advisers Inc.</Company> </PersonPositionCompany> ……. ``Dynegy has to act fast,'' said Roger Hamilton, a money manager with John Hancock Advisers Inc., which sold its Enron shares in recent weeks. ``If Enron can't get financing and its bonds go to junk, they lose counterparties and their marvelous business vanishes.'' Moody's Investors Service lowered its rating on Enron's bonds to ``Baa2'' and Standard & Poor's cut the debt to ``BBB.'' in the past two weeks. …… Fact <Company>Enron Corp</Company> <Company>Moody's Investors Service</Company> <CreditRating> <OFFLENOFFSET="3814" LENGTH="61" /> <Company_Source>Moody'sInvestorsService</Company_Source> <Company_Rated>EnronCorp</Company_Rated> <Trend>downgraded</Trend><Rank_New>Baa2</Rank_New> <__Type>bonds</__Type> </CreditRating> Event
The BI “hammer” Approach Document Vector Antiobiotics,Peptidyl,Eubacteria,RNA,Mg,… SVD Analysis ( λ1, λ2, ..., λn ) { λ1, λ2, ..., λn, Structured attributes }
Contextual RefinementETL and Semantic understanding unite Direct access to RDBMs for info from some Telco’s ESP lookup Logic for cleansing Ordered hits (by quality) XML feed from other Telco’s Cleansed data to ESP XML Ambigous data (close hits or unidentified) Flat files (CSV or fixed)from the ’laggards’ clean data ’Error’ database for manual inspection, correction, storage/learning Master database for persistant storage
Contextual InsightQuery-time fact analysis @ sub-document level “…entry probe carried to[Saturn]’s moon Titanas part of the…” Intent Concepts
Automatedvisitor ratings Contextual NavigationThisIsTravel
SQL-70 Oracle-79 SQL-89 SQL-92 SQL-99 GIGABYTES SQL-03 Revisit the Assumptions … 2003: 24B Scalable Search 2002: 12B Cave paintings,Bone tools 40,000 BCE Writing 3500 BCE 2001: 6B 0 C.E. Paper 105 2000: 3B Printing 1450 Electricity, Telephone 1870 80% Unstructured Transistor 1947 Computing 1950 Internet (DARPA) Late 1960s The Web 1993 1999