MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

EXTENSIONS TO DSS • BI systems (aka EIS) • Geographical Information Systems (GIS) • Collaborative Systems (formerly GDSS) • Expert Systems • OLAP/Data mining/warehousing

DATA WAREHOUSES

DATA WAREHOUSE A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining)

DATA WAREHOUSING USERS APPLICATIONS WAREHOUSE User1 Prodn. S E R V E R Mktg. User2 Fin. User3 User4 Acctg.

THE DATA LIFE CYCLE Dashboard/BI Internal Data Data Visualization Data Marts OLAP External Data Data Warehouse Decisions Internal Data Data Mining Data Marts Knowledge The activities taking place with respect to data for warehouse/OLAP/mining

INTEGRATION OF DATA Integration Operational databases Data Warehouse application A – m,f M/F application B – 1,0 application C – x,y application D – male, female

CHARACTERISTICS OF WAREHOUSES Data warehouses have a number of characteristics • Subject-oriented • Integrated • Time-variant • Non-volatile

KEY FEATURES OF DATA WAREHOUSES… Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time Nonvolatile: The data in a warehouse is permanent.

DESIGN OF WAREHOUSE Design of warehouses is similar to databases: • Schema not radically different • W/H designed primarily for queries • Database will be converted to cube at run time.

ORGANIZING THE W/H STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension. SNOWFLAKE SCHEMA: A variant of the star schema, Where some dimension tables are normalized, thereby splitting the data into additional tables. CONSTELLATION SCHEMA: A collection of stars.

STAR SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS What are the dimensions here?

SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS CREDIT DEBIT

CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI) DEPARTMENTS BRANCH ACCOUNTS EMPLOYEES TRANSACTIONS CREDIT DEBIT

EXAMPLE OF WAREHOUSE & MARTS Weekly sales by region 2007-2013 Sales by Product Line 2010-2013 Weekly sales by state 2005-2013 Weekly sales by product 2008-2013 Data Mart Data Mart Warehouse Sales Detail Sales Detail 2003-2013

OLAP & DATA WAREHOUSES

ONLINE ANALYTICAL PROCESSING OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December? • Aggregation (“data summarization”) • Dimension • Slice & Dice • Cube organization (Multi-dimensional database)

AGGREGATION & CUBE Region Units NE SE SW North. E 40 South E. 20 South W. 30 Midwest 20 South W. 50 North E. 65 65 20 30 40 50 sales in the Northern region?

DIMENSIONS & CONCEPT HIERARCHIES A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable. Dimensions can have hierarchies (or various levels of aggregations) A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

CONCEPT HIERARCHY Mfr. dimension iPhone5 iPhone4 150,600 GALII GAL FX G G2 13,600 30,000 15,067 18,240 51,326

MULTI-DIMENSIONAL ORGANIZATION Sales, costs etc. nw Products (tables, desks, lamps..) sw Regions Cube organization supports slice & dice

MULTI-DIMENSIONAL ORGANIZATION.. March sales February sales January sales shows multi-dimensional/cube organization

CUBE ORGANIZATION • Data from warehouse imported into memory • A sophisticated 3D representation is created • Referred to as “sparse matrix” • Sides of cube are dimensions • Allows “slice & dice” • Answers to high level queries/reports

DATA MINING

DATA MINING Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data. e.g. sequence/association, classification, and clustering • Buyer non-buyer • Success/failure of an operation • Customers who spend more • Preferences of target groups

SOME DATA MINING APPLICATIONS • Predicting the probability of default for consumer loans • Predicting audience response to TV advertisements • Predicting the probability that a cancer patient will respond to radiation therapy. • Predicting the probability that an offshore well will produce oil • A phone company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles

DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY Evaluation and Presentation Knowledge Data Mining Selection and Transformation Patterns Data Warehouse Cleaning and Integration Database Flat files

DATA MINING PROCESS Cleaning & integration – data is brought in from multiple sources Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin. Data mining - process of extracting data patterns, using statistical or AI techniques. Pattern evaluation - identifying patterns useful and relevant to the organizational context. Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user.

OLAP, MINING & WAREHOUSING... Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers. OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb? Data mining refers to identification of patterns from data.

DATA MINING TECHNIQUES

TYPES OF ANALYSES Sequence -- Activities occurring one after another e.g. loan after buying car, warranty. Association -- (AKA Market Basket Analysis) Activities which occur together (e.g. bread and meat) Classification -- Identifying profiles of data classified into pre-defined groups (frequent & infrequent shoppers) Clustering -- Identifying natural characteristics of data (what major areas are cust. coming from?)

SEQUENTIAL ANALYSIS

SEQUENTIAL ANALYSIS Predict purchasing • 40% of customers buy a gray skirt six months after buying a red jacket • Identified by time-series analysis, Neural networks or genetic algorithms

IDENTIFYING SEQUENTIAL PATTERNS.. applications in forecasting exchange rates, meat consumption , bankruptcies etc.

ASSOCIATIVE ANALYSIS

ASSOCIATION Identifies items purchased together • Customers who bought items A, B, C also buy Y • 85 percent of customers who buy a certain wine brand also buy a certain type of pasta • Many men buy a six-pack when they purchase diapers • Enhances product placement in stores

ANALYSIS TECHNIQUES ASSOCIATION Algorithm A-priori • list all items – 1 item set (C1) • filter by min. transaction support (L1) • identify 2-item sets (L1* L1) = (C2 ) • filter and generate L2 • repeat process... Database Min. transaction support is the number (sometimes given as %) of transactions in which the item must occur.

A-PRIORI ALGORITHM minimum Support: 2 Scan Database C1 L1 Scan Database minimum Support: 2 L2 C2 C2 Scan Database L3 C3 C3

DISCUSSION Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two.

CLASSIFICATION ANALYSIS

CLASSIFICATION A technique for grouping data into pre-defined classes using certain attributes of the data. E.g. defaulter or not, cruise customer, 4G subscriber or not etc. • Define classes • Identify classification criteria • Select suitable technique • applicationy

CLASSIFICATION TECHNIQUES • Discriminant analysis (categorical membership) • Bayesian classification • Logistic regression (probabilistic membership) • Neural networks

DISCRIMINANT ANALYSIS* A method of classification that uses a Discriminant Function to decide classes • DF based on attributes and weights. • Calculated as a weighted average of attributes and weights (of training data) (Zave score) E.g. (GMAT + 200 * UGPA) > 1200 • Given a new set of data, calculate Z score • Compare with Zav DF – Discriminant Function *a simplified version

CLASSIFICATION WITH BAYESIAN ANALYSIS

REFRESHER ON PROBABILITY Probability is the chance that an event/outcome will take place. • E.g. Cubs had 43 wins 36 losses in one season • Probability of winning?

PRIOR (CONDITIONAL) PROBABILITIES Prior probabilities are knowledge of other events which may help improve predictions • E.g. p(IPO success) = 0.33 • Suppose we know that a big company behind IPO • p(IPO success/big company) = 0.99 Which is higher? P(successful cellphone call) or P(successful call/subscriber in service area)

CONDITIONAL PROBABILITIES.. If we see a student in the union and he/she is a WIU student What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?

BAYESIAN ANALYSIS Bayes theorem can be exploited for classification • Bayes theorem • P(A/B) = [P(B/A) x P(A)]/P(B) • Classify into class I or II based on conditional probability • P(class I/x) > P(class II/x)? * • E.g. wiu student in union, which college are they? A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events *x is some condition e.g. surgery or being a shopper in a retail chain

INTRODUCTION TO BAYESIAN ANALYSIS We are interested in p(person becoming a manager/mba)* How can we use Baye’s theorem? ASSUME • 300 m population; 100 m employees • 500,000 are managers • 10,000 managers go to college for an MBA • 20 m go to college • 100,000 do MBA *you need to write formula using terms from the problem

THIS SLIDE IS INTENTIONALLY BLANK

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI