1 / 35

COMP 3503 Data Warehousing

COMP 3503 Data Warehousing. with Daniel L. Silver. Agenda. Beginning at the End Data Consolidation and DW Overview Technology of Data Warehousing Management Issues Consolidation and Cleaning Process. The KDD Process. Interpretation and Evaluation. Data Mining. Knowledge.

Download Presentation

COMP 3503 Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP 3503Data Warehousing with Daniel L. Silver

  2. Agenda • Beginning at the End • Data Consolidation and DW Overview • Technology of Data Warehousing • Management Issues • Consolidation and Cleaning Process

  3. The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources

  4. Let’s Begin at the End of Data Preparation • DM methods require data arranged into a row and column table structure similar to an Excel spreadsheet Columns = Variables Rows = Units of Analysis

  5. Data Consolidation and Data Warehousing

  6. The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse

  7. Data Consolidation Garbage in Garbage out • The quality of results relates directly to quality of the data • 50%-70% of KDD process effort will be spent on data consolidation, cleansing and preprocessing • Major justification for a corporate Data Warehouse

  8. Data Consolidation & Warehousing From data sources to consolidated data repository Analysis and Info Sharing RDBMS Inflow Data Consolidation and Cleansing Legacy DBMS Warehouse or Datamart Flat Files Upflow Downflow Outflow Metaflow External

  9. Data Consolidation –The Process • Collect & Consolidate • Define requirements - Generate data model • Identify authoritative sources (internal/external) • Extract required data (ETL) • Load/Integrate into working database (ODS) • Generate meta-data = data about the data (MDR) • Clean - Measure data quality at the source • Completeness - Accuracy - Integrity • Load only clean data into warehouse • Schedule periodic source checking/cleansing

  10. Data Warehousing The strategic collection, cleansing, and consolidation of organizational data to meet operational, analytical, and communication needs.

  11. Operational DB Data from transactions Application oriented Specific Current Details Changes continually Data Warehouse Data from internal systems Subject Oriented Holistic integrated Current + historical Details + Summaries Stable Data Warehousing Major DW Framework suppliers/consultants: Teradata, Oracle, SAP/Sybase, Microsoft, HP/Vertica

  12. The Technology of Data Warehousing

  13. Data Warehousing - Technology Choices for a Warehouse Repository • Relational DBMS (Oracle, IBM,Sybase, MS) • Supports very large, multipurpose databases • Multidimensional access via ROLAP methods • Slow for massive/complex data analysis • Multi-dimensionsal DBMS (MS, Oracle, IBM, SAP, MicroStrategy) • Fast, full feature OLAP • Size limitations - 5 GB of raw data (100 GB total) • Standards still evolving, proprietary systems

  14. Data Warehousing DW Data Models • RDBMS (DB2, Oracle, Sybase, Ingres) • Data stored in tables - select, join, project • Based on logical normalized data model • Think “independent spreadsheet tables with links to each other” • SQL = Structured Query Language • e.g. SELECT CUSTOMER-NAME FROM CUSTOMER-MASTER WHERE CITY = “HALIFAX” AND TOTAL-PURCHASES > “500.00”

  15. Data Warehousing DW Data Models • MDBMS (MS, Oracle, IBM, SAP) • Multi-dimensional matrix (spreadsheet) of many variables with various levels of abstraction • Think “Rubics Cube” • Rotate, select range, roll-up, drill-down through levels • 10% data, 90% indices - facilitates rapid numerical analysis and visualization • OLAP = On-Line Analytical Processing

  16. Data Warehousing Future DW Data Models • Object-Oriented (ODBMS = POET, ODE) • classes (includes data types and functions) • objects are instantiations of classes • inheritance of class properties • used for modeling complex and large objects • better suited for distributed computing • Object Relation (DBMS = Omniscience) • best of relation and object oriented: SQL • transactions + object classes • Deductive (DRBMS) • Extension of RDBMS, embeds logic within database

  17. Data Warehousing OLAP Knowledge Workers “The Ideal Picture” Stats IDT Data Marts & Analytical Pocessors ANN One or more central repositories Data Warehouse Operational feedback from analytics Extraction Transformation Load Operational Data Store (ODS) Source Systems and Operational Users

  18. Data Warehouse Enterprise Wide RDBMS Normalized Full Time Horizon Read-only /Non-Volatile No End-user Access Data Marts Functional Area Subsets RDBMS or MDBMS Aggregated / Summarized Data Limited Time Horizon Analysis / Responsive End-user Access Data Warehouse / Data Mart This slide courtesy Anders Stjarne

  19. Data Warehousing The Importance of Metadata • Data about the data • Provides the key link between business users and data – a schema for the data • Name, ID, description, source • Data type, size, range, default value • Metric scale (nominal, ordinal, interval, continuous) • A good DW and DM project should provide • Ability to browse accurate and accessible metadata encouraging use of the DW • Provides appropriate levels of security • Secure load and update

  20. Data Warehousing - Meta Data • What is Meta Data? • Any data about data • Describes / defines the meaning and representation form • Provides a context for understanding and interpreting • Why Meta Data? • To understand and track data – from its source & over time • Basis for integration and sharing • Who needs / uses Meta Data? • Users of the data – Mgr/DSS analyst starts here to plan query • Tools / IS apps which record, read, and process the data • Conflicting objectives: • Consistency, uniformity, standards across the organization • vs. End user autonomy, freedom to manipulate & analyze • Historical efforts – mainframe data catalogue or data dictionary This slide courtesy Anders Stjarne

  21. Data Warehousing Management Issues

  22. Data Warehousing Management Issues • Data Warehouse projects that have completed have enjoyed up to 400% ROI • Historically 75% of Data Warehouse projects did not complete as originally envisioned • projects too large, objectives not clear • 50% become Data Mart projects - scaled down: specific, achievable, measurable objectives • 25% die a slow death

  23. Data Warehousing • DW is not a project – it is an on-going set of organizational activities • Must be driven by business objectives • Rapid change in technology and business requirements => demands short cycles, flexibility • Keys to success are strategic planning with tactical deployment (informed bottom-up approach): • few data sources -> single Data Mart • multiple data sources -> multiple Data Marts • evolve to enterprise Data Warehouse • DW are now becoming cloud-based (distributed and web enabled)

  24. Relationship between DW and DM? Strategic Tactical Rationale for data consolidation Analysis Query/Reporting OLAP Data Mining Data Warehousing Source of consolidated data

  25. Data Consolidation and Cleaning Process Details

  26. Consolidating Internal Sources • Define business requirements, subject areas • Generate data model • Identify authoritative sources: • operational systems (process control, POS) • billing, accounting, administrative systems • branch offices, PCs, filing cabinets => paper! • Identify and involve data source stakeholders and operational system’s maintenance personnel

  27. Consolidating Internal Sources • ETL = Extraction, Transformation and Load [ref: http://en.wikipedia.org/wiki/Extract,_transform,_load] • A process in data warehousing that involves: • Extracting data from outside sources • Transforming it to fit business needs • Loading it into the data warehouse.

  28. Data Cleaning • Act of detecting and correcting (or removing) corrupt or inaccurate attributes or records

  29. Data Cleaning Process • Identify authoritative data sources • Measure data quality • Completeness - all of the data • Accuracy - correct • Integrity- consistent, no ambiguities • Identify inconsistent, missing, incomplete, redundant, and incorrect values • Clean data whenever possible at the source • Load only clean data into warehouse • Schedule periodic source checking/cleansing

  30. Data Cleaning and Preparation Missing or Invalid Data • Missing or Invalid data values may be due to • Equipment malfunction • Inconsistent with other recorded data and thus deleted • Data not entered due to misunderstanding • Not be considered important at the time of entry • Changes in the data not recorded • What to do? • Inferred (imputed) values • Remove the row or column

  31. Data Cleaning and Preparation Noisy Data • When two or more examples conflict in terms of the dependent variable • Incorrect attribute values may be due to • Faulty data collection instruments • Data entry problems • Data transmission problems • Technology limitations (cheap thermometer) • Inconsistency in naming convention • What to do? • Ignore • Remove the row

  32. Data Cleaning and Preparation • Data cleaning is part of the ETL and makes use of an Operation Data Store • Automated cleaning tools are available • Trillum Software (Harte-Hanks) www.trilliumsoft.com • Winpure http://www.winpure.com/Article--DataCleaningTool.html • Vality -> Ascential -> IBM Websphere Data Intergratorhttp://www-306.ibm.com/software/data/integration/ascential.html • Consolidate internal with external sources • May decide to prepare data for exploratory analysis using spreadsheets, OLAP, visualization software

  33. Data Enrichment Requirements • Behavioral • purchase from related businesses (Air Miles) • Eg. number of vehicles, travel frequency • Demographic • Eg. age, gender, marital status, children, income level • Psychographic • Eg. “risk taker”, “conservative”, “cultured“, “hi-tech adverse”, “credit worthy”, “trustworthy”

  34. Data Enrichment Sources • Statistics Canada (census, enumeration area) • Software for under $1000 • Data from $100 - $4000 (region - nation) • http://www.statcan.ca/ • Association joint surveys • Equifax (based on postal code segments) • Data from $500 - $50,000 • http://www.equifax.com/ • Loyalty management organizations • LMG - http://www.loyalty.co.uk • http://www.loyalty.ca • http://www.loyalty.com • Independent survey ($20,000 - $200,000)

  35. THE ENDdanny.silver@acadiau.ca

More Related