1 / 57

Top Five Data Challenges for the Next Decade

Top Five Data Challenges for the Next Decade. Dr. Pat Selinger IBM Fellow and VP, Area Strategist. The World of Data is Changing. Hardware gives us more choices than ever before Cost of labor is rising Data isn’t all (or even mostly) in the database Data access paradigms evolving

hesper
Download Presentation

Top Five Data Challenges for the Next Decade

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top Five Data Challenges for the Next Decade Dr. Pat Selinger IBM Fellow and VP, Area Strategist

  2. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  3. ..xyz.. Keyword-based Search Engines SPAM Research Challenges - Examples

  4. Research Challenges – Examples

  5. Research Challenge – Examples Q: Can you spell your name please ? A: P.A.T. Q: One more time please… A: P..… A..… T..… Q: Sorry… connecting you to a live operator… one moment, please.

  6. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  7. 1975 1 MIPS processor Mainframe uniprocessor 14 inch disks 24 bit addresses 256K real memory Channel to channel connections Strings and numbers Today 2+ GigaHertz processors 32 and 64-way SMPs RAID disks, logical volume managers 64 bit addresses 100+ GB real memory Gigabit Ethernet, Infiniband supporting clusters of systems Rich data (audio, documents, XML, …) Issue: HW and SW systems have changed since RDB was invented. Information mgmt architecture hasn’t kept pace

  8. Transactions 100-500GB Warehouses 100s GB – 10’s TB Marts 1 - 50 GBs Mobile 100s MB Pervasive 100s KB Issue: Data Volumes Exploding 2010 Common Database Sizes 1s TB 10X 100s TB Workload 2005 100X 1s TB 100X The world produces 250MB of information every year for every man, woman and child on earth. 10s GB 1,000X 1s GB 10,000X 85% of the data is unstructured.

  9. Storage Trends Aid this Data Explosion Storage aerial density CGR continues at 100% per year to >100 Gbit/in2. The price of storage is now significantly cheaper than paper.

  10. Issue: CPU performance growing by 100% I/O performance by 5% every year CPU Disk

  11. Solution: Overlapping, deferring or avoiding I/O. Examples: • Multi-dimensional Clustering • Multiple bufferpools • Prefetching into Bufferpools • Page Cleaners use Async I/O • Indexes with added columns or tables in indexes • Index anding and oring • Pushdown predicates • Function-shipping on clusters • Materialized Query Tables • Compression • And much more….

  12. Multi-Dimensional Clustering via Cells and blocks: Cell for 1998, 1998, 1997, (1997, Canada, Canada, Mexico, Canada, yellow yellow blue yellow) 1997, 1997, nation Canada, Canada, yellow yellow dimension Each cell 1998, 1998, 1997, contains one or Mexico, Canada, Mexico, yellow more blocks yellow blue 1997, 1997, Mexico, Mexico, yellow yellow color year dimension dimension

  13. Research Challenge #1Scalability: Massive Growth in Multiple Dimensions Unlimited CPUs • Scaling directions: • Petabytes of storage • “Fire hose” of data continuously loading • Millions of users, • Millions of processors • Larger and more complex data objects • Systems being only partly online • Partial answers, relevancy ranking 10**6 processors 1000 processors

  14. Research Challenge #1 • Design our DBMSs to keep pace with HW, SW, data changes • Scale without sacrificing user-visible availability or performance. • While always inventing new techniques to “cover up” the ever-increasing gap between processor speeds and disk speeds, e.g. exploit large memories

  15. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  16. Cost of Labor Increasing While Demands Rising Changing Ecosystem Rising Costs Lower Cost Storage TighterIntegration Structured andUnstructured Data Low CostClusters HigherAvailability More DynamicWorkloads • Labor-intensive management effort • Scarcity of skilled DBAs

  17. Autonomic Computing:Deliver significantly lower total cost of ownership • Cost of application development, time to solution delivery • Labor cost and skills availability for database administration and management

  18. Available in v8.1.x Up and running Configuration advisor Sets dozens of the most critical parameters in seconds. Heaps, process model, optimizer, and more. Automated physical database design Design Advisor Automated index selection. Runtime Industry leading query optimizer, automatic high quality plan selection. Query Patroller workload manager. Policy controlled management of SQL/ODBC. Query throughput control with QP query classes Usage trending reports with QP Historical Analysis Real-Time monitoring and control of current running queries Self tuning LOAD Adaptive utility throttling for Backup Allows maintenance to consume as much resource as possible without impacting the user workload throughput beyond Policy specified constraint. Control Center scheduler Task Center (within CC) can schedule/automate execution of OS or DB2 scripts. Self healing, availability and diagnostics Health Monitor Ensures proper database operation by constantly monitoring key indicators. Notification of alerts by e-mail, page, CLP, GUI, SQL. Health Center tooling provides graphical tools to drill down on details. Fault monitor Automatically restarts DB2 Automatic Index Reorganization Automatically defragment leaf pages. Automatic continual I/O consistency checking Autonomic Capabilities Available Today in DB2 for Linux, Unix, Windows New in DB2 UDB v8.2 • Automated physical database design • Design Advisor extensions • Combined (or individual) recommendations for indexes, MQTs, MDC, and DPF partitioning. • Automatic workload compression. • 4 workload capture techniques. (package cache, Query Patroller, event monitor, text file) • Exploits sampling and multi-query optimization • Runtime • Automated database maintenance • Automation of Backup, Runstats, Reorg • Statistics collection is online, throttled, with new locking protocols for non-intrusive collection. • Policy expression lets users select subset of schema, and available times of day. • Advanced algorithms detect “when” maintenance is really needed. • Automatic statistics profiling • Determines what statistics should be collected. • Automatic detection of column groups allows query optimizer to model correlation. • First “industrial version of “LEO” technology. • eWLM integration • Performance analysis for the IBM stack. • Utility throttling for Backup, Runstats, Rebalance. • The v8.1.2 BACKUP throttling technology is extended to a broader set of administrative utilities. • Self tuning BACKUP • Up to 4x faster than v8.1.x defaults • Simplified memory management • Heaps automatically grow when constrained • Self healing, availability and diagnostics • Common Logging across IBM software products. • HADR with automatic client reroute • Extensions to Health Monitor • Increase recommendations for user response to alerts. • Self protecting • Data Encryption • Common Criteria Certification • Enhanced Security for Windows users

  19. Example: DB2 Design Advisor • Makes recommendations for: • Indexes on the base tables • Materialized Query Tables • Indexes on the Materialized Query Tables • Converting non Multi-Dimensional Clustering tables to Multi-Dimensional Clustering tables • Partitioning existing tables

  20. Research Challenge #2Examine radically simpler architectures and address total cost of ownership Research Challenge: Zero Admin.For Complex Apps Enterprise Class Scale and Performance Complex, Unknown Application Characteristics Small DB engines Open Source Current Product Autonomic Efforts Simple, Understood High End DBMS Small businesses Business Segment Enterprise

  21. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  22. Nature of “Interesting” Data is ChangingHow do we process these in an integrated way? Classic Information Management -- relational databases Unstructured InformationManagement Information from Multi-Modal Interactions, e.g. speech Employee • Autonomous ? Department Product Inventory Sales Data Bank Accounts Warehouses ... 85% unstructured and not in DBMS

  23. Addressing the Changing Characteristics of Data Increasing need to manage and analyze new data types Actionability Satellite & Surveillance Gene Sequences Images and Video Protein Folding Transactions Heterogeneity Text and Web Scale

  24. Actionability -High -Low Heterogeneity -Low Scale Changing Characteristics of Data Transactions and structured data Volume growth versus semantics per unit of data Seat on an airplane: easy to find, structured data

  25. Changing Characteristics of Data Text and other human data Actionability Medium - -Medium Heterogeneity -Medium Scale Hard work to extract the pearl, but you know where to look

  26. Changing Characteristics of Data Machine-generated data Actionability -High Low - Heterogeneity - High Scale There is gold somewhere in the pile, and you need to keep sifting

  27. Extending “Mission-Critical” to Unstructured Data XML has become the “data interchange” format. • XML View Of Relational Data • SQL data viewed and updated as XML • Done via document shredding and composition • DTD and Schema Validation • XML Documents As Monolithic Entities • Atomic Storage And Retrieval • Search Capabilities Next: XML As A Rich Datatype • Full storage and indexing • Powerful querying capabilities

  28. Example: XML Strategy for DB2 UDB Native XML capabilities inside the engine SERVER CLIENT SQL(X) Relational Relational Data management Storage DB2 Interface client Server XQuery XML Customer client XML Storage application Interface

  29. Content Management Solutions - Capability Content Solutions Information Integration Workflow/Business Process Management/Collaboration Document Management Web Content Management Output/Report Management Archiving Multimedia Management IBM Content Management Portfolio Imaging Content Integration Digital Asset Management Regulatory Compliance / Records Management Digital Rights Management

  30. Enterprise Content Management …Content-enabled Business Processes…Electronic Statements …e-Mail Management…e-Records Management… Cross Industry • Customer Service • Human Resources • Accounts Payable • Records Management • Marketing Communications • Online Report Viewing • E-mail Archival • Business Continuity Financial • Loan Origination, Signature Verification • Credit Card Dispute Handling • Retirement Account Management • Mutual Fund Processing • Leasing and Contract Management Retail/Distribution • Vendor Management • Claims and Loyalty Management Programs • Web Site Content Mgmt. • Digital Content Commerce Transforming Processes with Digital Content Insurance • Claims, Underwriting, Policy Service • Agent Management Transportation: • Proof of Deliveries, Service • Driver Management Government • Law Enforcement and Land Records • Permits, Licensing, Vital Records • Constituent Correspondence & Services • Tax Form Capture Manufacturing • Engineering Documentation, Change Management and ISO 9000 Cert. • Product Management • Customer and Channel Service • SAP Data Archiving and Document Management

  31. Classic Data and Content Management Converging • Content Manager provides more “Data Management” services • Transactional and referential integrity • Optimized query • Scalable storage • RDB users want more “Content Management” services • Check-in, check-out and versioning • Integrated hierarchical storage management • Non-normal (i.e. hierarchical) metamodel • XML is accelerating this convergence • Sometimes it’s data – other times it’s content

  32. So, are we done? No!

  33. Research Challenge # 3 • Every one of us should know Content APIs as well as we do SQL • Content Management has VERY different requirements than • Short atomic transactions with two phase commit • Two phase locking • B-tree indexing • Cursors • ….

  34. Research Challenge #3 • We need to learn what managing content is all about, what is needed and forge new models: • Query and client interaction • Versioning • Foldering • Sub-document authorization • Sub-document checkin/out • Text search and analytics

  35. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  36. Research Challenge #4Data Interaction Paradigms – What’s Next? Speech enhanced with semantics ? Search Engines Ease of Data Access Web Ease of Access Spreadsheets Relational DB Programs Audio, Video, Sensor Text Strings and Numbers Richness of Data

  37. Embracing richer data types and functionality in information management middlewareSpeech Technology will Enable New and Easier Applications Shared Infrastructure and Business Logic Contact Points Business Processes Customers Face to face Voice IM Web Branch office Scheduling and Coordination Web Call Center IVR Kiosks Email, SMS Mail, Fax, etc Workforce Integrated Interaction Channels Web logs Speech transcriptions Call logs …. Analytics Analytics Across Data Types Business Intelligence

  38. Toards SperHuman Speech Recognition Goal: Surpass human ability to accurately transcribe speech across multiple domains and environments. IBM Value: This level of performance required to achieve truly pervasive conversational technologies. 1997-2001 Transparent to user Cooperative User No feedback Immediate Feedback High Bandwidth Microphone Basic Principles and New Techniques Data driven with careful statistical modeling Wide variety of test data Regular benchmarks of human performance Speech Recognition Technology Evolution 2007-2010 Variable Overlapping Multiple Multiple Accented Domains Channels Noise Talkers Speech Graded Challenges Across channel, domain, environment System 1 System 2 Recognizer System 3 Discrimination Adaptation Fusion

  39. Text To Speech Generation Technology Impressive quality Can you guess what is TTS and what is recorded speech?

  40. UIMA - The Big Picture Analytics bridge the Unstructured & Structured worlds UIMA Unstructured Information Structured Information Text, Chat, Email, Audio, Video Indices DBs Identify Semantic Entities, Induce Structure • Chats, Phone Calls, Transfers • People, Places, Org, Events • Times, Topics, Opinions, Relationships • Threats, Plots, etc. KBs High-Value Most Current Content Fastest Growing BUT ... Buried in Huge Volumes – Lots of Noise Implicit Semantics Inefficient Search Explicit Structure Explicit Semantics Efficient Search Focused Content

  41. Application Libraries Specialized Application Libraries Provide basic functions common to a broad class of application libraries & applications (e.g. Glossary Extraction Taxonomy Generation, Classification, Translation, etc.) Semantic Search Engine Token and Concept Indexing Query Key words, concepts, spans, ranges -> Ranked Hit List Unstructured Information Collection Processing Manager Document & Meta Data Store Documents with meta data based on key-value pairs Enables view & collection management Question Answering e-Commerce UIMA Standard Application Libraries UIM Solutions Relevant Application Knowledge (Text) Analysis Engine (TAEs) Combination of analysis engines employing a variety of analytical techniques and strategies National & Intelligence Business Bioinformatics Technical Support Structured Knowledge Access Knowledge Source Adapters - (KSAs) deliver content from many structured knowledge sources according to central ontologies KSA Directory Service Dynamic query & delivery of KSAs TAE Directory Service Dynamic query & delivery of TAEs Structured Data Unstructured Information Management Architecture • Common Research infrastructure for advancing Text Analysis and NLP capability • Promotes re-use of best-of-breed components • Promotes combination hypothesis through ease of integration

  42. Ontologies Indices Text, Chat, Email, Audio, Video DBs Collection Reader Knowledge Bases Analysis Engine CAS CAS Initializer Annotator UIMA Component Architecture from “Source to Sink” Collection Processing Engine Aggregate Analysis Engine CAS Consumer Analysis Engine CAS Consumer Annotator CAS Consumer CAS CAS

  43. Language, Speaker Identifiers Part of Speech Detectors Document Structure Detectors Tokenizers, Parsers, Translators Named-Entity Detectors Sentiment Detectors Face Recognizers Relationship Detectors Classifiers What can analytics do?

  44. Gov Official Country Gov Title Person Basic Building Blocks: AnnotatorsIterate over a document to discover new annotations based on existing ones and update the Common Analysis Structure (CAS). Located In Relationship Annotator Arg1:Entity Arg2:Location Named Entity Annotator NP VP PP Parser Governor visits embassy in Japan Jones

  45. Information Retrieval String & Graph Algorithms Data Mining Unstructured Information Management Architecture Text analytics & NLP Machine Learning Privacy & Security UI / Human Factors If intimately integrated, various KM technologies will provide higher quality results (accuracy, recall, etc.) Research: The Combination Hypothesis Independent Analyzers Combined Analyzers via Common Annotation Structure (UIMA)

  46. Research Challenge #4 • Include speech and text data and derived text analytics and context in our scope of data research work. How does that change: • Access techniques, • Search and optimization algorithms • Result sets and interaction mechanisms • Storage and indexing • Models of data • Framework for derived information, ways to query and search it • System architecture

  47. The World of Data is Changing • Hardware gives us more choices than ever before • Cost of labor is rising • Data isn’t all (or even mostly) in the database • Data access paradigms evolving • Customers want integration and FAST access to the data they want

  48. Competitors • Pricing • Demand • Offerings • Proposals • Contracts • Negotiations CLIENT DATA Demographics, configurations, current costs, financial, legal, existing contracts, RFI, etc. .XLS .123 Notes Engagement Workbook Engagement Sage • Acct. Teams • SD .XLS .123 CCMT Marketing Delivery .DOC .LWP Notes .DOC .LWP .XLS .123 .DOC .LWP Intellectual Capital Ledger .XLS .123 Notes .DOC .LWP • Proposals • Contracts • Offerings • Historical Claim 123 654… Lessons Learned Notes Source Data Heterogeneity in Enterprises Today data is in disparate locations; it is not easily accessible nor harnessed for key information What data do we have? Where is it? How can I find it? What format is it in? Is it searchable? Does “customer” mean the same in each system? How do I reconcile differences? What applications feed data to other applications? If I change something, what breaks?

  49. Metadata: Today and Tomorrow Current Focus Identifying • Store • Search Current Challenge Opportunity Integrating Understanding • Definitions • Taxonomies • Complex relationships • Sophisticated semantics • Discover • Linkages within domains • Linkages across domains

  50. Metadata: Spectrum Metadata describes and adds meaning to data and business process Vocabularies & Concepts Information Structures About Applications, Processes, Resources

More Related