1 / 21

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

COMP207: Data Mining. COMP207: Data Mining. Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010. Data Warehousing. Today's Topics. COMP207: Data Mining. Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation.

Download Presentation

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP207: Data Mining COMP207: Data Mining Dr. M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2010 Data Warehousing

  2. Today's Topics COMP207: Data Mining Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation Data Warehousing

  3. What is a Data Warehouse? COMP207: Data Mining Most common definition: • “A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon • Corporate focused, assumes a lot of data, and typically sales related • Data for “Decision Support System” or “Management Support System” • 1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse Data Warehousing

  4. Data Warehouse COMP207: Data Mining • Subject-oriented: • Focused on important subjects, not transactions • Concise view with only useful data for decision making • Integrated: • Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. • Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope)‏ Data Warehousing

  5. Data Warehouse COMP207: Data Mining • Time-variant: • Has different values for the same fields over time. • Operational database only has current value. Data Warehouse offers historical values. • Nonvolatile: • Physically separate store • Updates not online, but in offline batch mode only • Read only access required, so no concurrency issues Data Warehousing

  6. Data Warehouse COMP207: Data Mining Data Warehouses are distinct from: • Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated.Integration done before loading, not at run time. • Operational DB: Only records current value, lots of extra non useful information.Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. Data Warehousing

  7. OLAP vs OLTP COMP207: Data Mining OLAP: Online Analytical Processing (Data Warehouse)‏ OLTP: Online Transaction Processing (Traditional DBMS)‏ OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time, location). Involves lots of full database scans, across terabytes or more of data. Typically aggregation and summarisation functions. Distinctly different uses to OLTP on the operational database. Data Warehousing

  8. Data Cubes COMP207: Data Mining Data is normally Multi-Dimensional, and can be thought of as a cube. Often: 3 dimensions of time, location and product. No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. • Image courtesy of IBM OLAP Miner documentation Data Warehousing

  9. Data Cubes COMP207: Data Mining • Can construct many 'cuboids' from the full cube by excluding dimensions. • In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'. • Can think of this as a lattice of cuboids...(Following lattice courtesy of Han & Kamber) Data Warehousing

  10. Lattice of Cuboids COMP207: Data Mining all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier Data Warehousing

  11. Multi-dimensional Units COMP207: Data Mining Each dimension can also be thought of in terms of different units. • Time: decade, year, quarter, month, day, hour (and week, which isn't strictly hierarchical with the others!)‏ • Location: continent, country, state, city, store • Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. Data Warehousing

  12. Star-Net Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP DISTRICT SALES PERSON REGION DISTRICT COUNTRY DIVISION Geography Promotion Organization COMP207: Data Mining Data Warehousing

  13. Data Cube Operations COMP207: Data Mining • Roll Up: Summarise data by climbing up hierarchy.Eg. From monthly to quarterly, from Liverpool to England • Drill Down: Opposite of Roll UpEg. From computer to laptop, from £100-999 to £100-199 • Slice: Remove a dimension by setting a value for itEg. location/product where time is Q1,2007 • Dice: Restrict cube by setting values for multiple dimensionsEg. Q1,Q2 / North American cities / 3 products sub cube • Pivot: Rotate the cube (mostly for visualisation) Data Warehousing

  14. Data Cube Schemas COMP207: Data Mining • Star Schema: Single fact table in the middle, with connected set of dimension tables(Hence a star)‏ • Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables(Hence looks like a snow flake)‏ • Fact Constellation: Multiple fact tables can share dimension tables(Hence looks like a collection of star schemas. Also called Galaxy Schema)‏ Data Warehousing

  15. Star Schema COMP207: Data Mining Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_type Sales Fact Table time_key item_key location_key Loc.n Dimension location_key street city state country continent units_sold Measure (value)‏ Data Warehousing

  16. Snowflake Schema COMP207: Data Mining Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_key Sales Fact Table time_key item_key location_key units_sold Loc Dimension location_key street city_key City Dimension city_key city state country Measure (value)‏ Data Warehousing

  17. Fact Constellation COMP207: Data Mining Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_key Shipping Table Sales Fact Table time_key time_key item_key item_key from_key location_key units_shipped units_sold Loc Dimension location_key street city_key City Dimension city_key city state country Measure (value)‏ Data Warehousing

  18. OLAP Server Types COMP207: Data Mining ROLAP: Relational OLAP • Uses relational DBMS to store and manage the warehouse data • Optimised for non traditional access patterns • Lots of research into RDBMS to make use of! MOLAP: Multidimensional OLAP • Sparse array based storage engine • Fast access to precomputed data HOLAP: Hybrid OLAP • Mixture of both MOLAP and ROLAP Data Warehousing

  19. Data Warehouse Architecture Other sources Operational DBs Extract Transform Load Refresh COMP207: Data Mining (also courtesy of Han & Kamber)‏ Monitor & Integrator OLAP Server Metadata Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Data Warehousing

  20. Materialisation COMP207: Data Mining In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data. • None: Very slow, as need to compute entire cube at run time • Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids • Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water".Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. Data Warehousing

  21. Further Reading COMP207: Data Mining • http://en.wikipedia.org/wiki/Data_warehouseand subsequent links Data Warehousing

More Related