1 / 10

Key Challenges in Information Processing

Key Challenges in Information Processing. James Hamilton JamesRH@microsoft.com Microsoft SQL Server 2002.03.01. Unsolved Challenges. Availability shows only incremental progress Security broken & too hard to manage Weakly structured data poorly supported or exploited

aviv
Download Presentation

Key Challenges in Information Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Key Challenges in Information Processing James Hamilton JamesRH@microsoft.com Microsoft SQL Server 2002.03.01

  2. Unsolved Challenges • Availability shows only incremental progress • Security broken & too hard to manage • Weakly structured data poorly supported or exploited • Writing Multi-tiered apps too hard • Data intensive mid-tiers need more DB help • Scalability over perf & big-iron

  3. Availability:Largely unsolved problem • 1985 Tandem study (Gray): • Administration: 42% downtime • Software: 25% downtime • Hardware 18% downtime • 1990 Tandem Study (Gray): • Software 62% • Administration: 15% • Most studies have admin contribution much higher • Observations: • H/W downtime contribution trending to zero • Software & admin costs dominate & growing • We’re still looking at 10 to 15 year-old research

  4. Availability:Cost in dollars/hour • Brokerage operations $6,450,000 • Credit card authorization $2,600,000 • Ebay (1 outage 22 hours) $225,000 • Amazon.com $180,000 • Package shipping services $150,000 • Home shopping channel $113,000 • Catalog sales center $90,000 • Airline reservation center $89,000 • Cellular service activation $41,000 • On-line network fees $25,000 • ATM service fees $14,000 From Dave Patterson Talk at HPTS 2001 -- Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”... survey done by Contingency Planning Research."

  5. Availability: Admin stillthe problem • Administrators expensive • Admin dominate H/W & S/W costs (5x or more) • Administrators make mistakes • Admin #1 or #2 cause of downtime • Big problem yet little research focus: • Still few data points available: • Most systems houses won’t publish ... need research • No benchmarks: • Benchmarks drive industry & systems research • Goal: Server appliance model: • Auto-tuning, pluggable server-side resources • IBM SMART, Microsoft index tuning wizard, etc. • Dave Patterson, Aaron Brown, Armando Fox, ... • More help needed

  6. Availability:the S/W is broken • Even server-side software is BIG: • Windows2000: over 50 mloc • DB: 1.5+ mloc • SAP: 37 mloc (4,200 S/W engineers) • Tester to Developer ratios above 1:1 • Quality per unit line only incrementally improving • Current massive testing investment not solving problem • New approach needed: • Assume S/W failure inevitable • Redundant, self-healing systems right approach • Tandem process-pair work good but getting fairly old ... progress?

  7. Security:Securing systems too hard • “Less than 0.0025% of corp revenue invested in security” – Richard Clarke, Special security advisor to president • Data loss, intentional data & systems corruption • Clearly under-reported problem • S/W Vulnerabilities rampant: • Buffer overruns, stack smashing, code insertion, SQL insertion, elevation of privs, ... • Programmers being more careful doesn’t solve problem • Most systems miss-configured: • Security systems too complex & hard to admin • Research needed: Autonomous threat detection • better tools to detect, correct, & prevent S/W security vulnerabilities • Monitor all measurable system metrics: • Detecting new threats & miss-configurations • Track execution profiles: detect changes: drive alerts, auto-config, reports to vendor, upgrade s/w,...

  8. Unstructured Data:Mostly not stored in DB • All data has some schema but not always fully known nor affordable to pre-declare: • Most data in unstructured stores with text search • DB community is losing • Much research work on XML focused upon: • Mapping XML to relational scheamas • leverages existing relational IQ but not as flexible • New, non-relational (native XML) stores • Storing natively doesn’t leverage DB investment • Mostly mid-tier data integration servers • Research potential: • Native stores leveraging existing infrastructure esp. cost-based optimizers, storage engines, & utilities • IR work progressing but little integration into DB • Integrating IR work into DB W/O required schema, ability to exploit if there, ability to discover/infer if not

  9. Multi-tiered apps:we’re not helping • Many high scale multi-tiered apps still hand crafted • Needed: Object access layer, data cache, queuing, query compiler & optimizer, data directed routing, security, ... • Problem not adequately solved by industry • Integration with server-tier DB advantages: • ACID relaxation driven by attributes on apps or data • Relaxed models with auto-cache population & mgmt • Query parsing for data directed routing • Want to parse once & accept same lang as backend • Exploit optimizer: model full mid-tier to back-end costs • Where to run joins, functions, aggs, etc. • Need security integration W/O fully provisioning backend • Data intensive mid-tiers are a DB & TP problem: • Solve with DB tech & integrate with backend DB • Componentized DB for mid-tier use one approach

  10. Scalability: perf not the problem • Focus still on performance rather than scalability: • Clusters only “nearly” work • Must buy biggest iron & get most from it • Research goal: Server appliances • Gray’s servers by the brick • brick includes disk, memory, & CPU resources • Only admin actions required: • Add brick to, or defect from, cluster • Data redundancy (potentially) on geo-scale: • adapts to access patterns & available bandwidth • If zero-admin clusters actually worked & scaled: • performance would be a secondary issue • The admin problem would nearly go away • The S/W quality problem greatly simplified • Hiesenbugs solved via retry and redundancy • Would shift investment dollars from H/W & admin to S/W (where it belongs )

More Related