slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI PowerPoint Presentation
Download Presentation
MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

Loading in 2 Seconds...

play fullscreen
1 / 63

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI. EXTENSIONS TO DSS. BI systems (aka EIS) Geographical Information Systems (GIS) Collaborative Systems (formerly GDSS) Expert Systems OLAP/Data mining/warehousing. DATA WAREHOUSES. DATA WAREHOUSE.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI' - abla


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

MINING & WAREHOUSING (MSS2)

BY

CHANDRA S. AMARAVADI

slide2

EXTENSIONS TO DSS

  • BI systems (aka EIS)
  • Geographical Information Systems (GIS)
  • Collaborative Systems (formerly GDSS)
  • Expert Systems
  • OLAP/Data mining/warehousing
slide4

DATA WAREHOUSE

A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining)

slide5

DATA WAREHOUSING

USERS

APPLICATIONS

WAREHOUSE

User1

Prodn.

S

E

R

V

E

R

Mktg.

User2

Fin.

User3

User4

Acctg.

slide6

THE DATA LIFE CYCLE

Dashboard/BI

Internal

Data

Data

Visualization

Data

Marts

OLAP

External

Data

Data

Warehouse

Decisions

Internal

Data

Data Mining

Data

Marts

Knowledge

The activities taking place with respect to data for warehouse/OLAP/mining

slide7

INTEGRATION OF DATA

Integration

Operational databases

Data Warehouse

application A – m,f M/F

application B – 1,0

application C – x,y

application D – male, female

slide8

CHARACTERISTICS OF WAREHOUSES

Data warehouses have a number of characteristics

  • Subject-oriented
  • Integrated
  • Time-variant
  • Non-volatile
slide9

KEY FEATURES OF DATA WAREHOUSES…

Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales.

Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records.

Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time

Nonvolatile: The data in a warehouse is permanent.

slide10

DESIGN OF WAREHOUSE

Design of warehouses is similar to databases:

  • Schema not radically different
  • W/H designed primarily for queries
  • Database will be converted to cube at run time.
slide11

ORGANIZING THE W/H

STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension.

SNOWFLAKE SCHEMA: A variant of the star schema,

Where some dimension tables are normalized, thereby splitting the data into additional tables.

CONSTELLATION SCHEMA: A collection of stars.

slide12

STAR SCHEMA OF A DATA WAREHOUSE (FYI)

BRANCH

ACCOUNTS

TRANSACTIONS

What are the dimensions here?

slide13

SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI)

BRANCH

ACCOUNTS

TRANSACTIONS

CREDIT

DEBIT

slide14

CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI)

DEPARTMENTS

BRANCH

ACCOUNTS

EMPLOYEES

TRANSACTIONS

CREDIT

DEBIT

slide15

EXAMPLE OF WAREHOUSE & MARTS

Weekly sales by region

2007-2013

Sales by Product Line

2010-2013

Weekly sales by state

2005-2013

Weekly sales by product

2008-2013

Data Mart

Data Mart

Warehouse

Sales Detail

Sales Detail

2003-2013

slide17

ONLINE ANALYTICAL PROCESSING

OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December?

  • Aggregation (“data summarization”)
    • Dimension
  • Slice & Dice
    • Cube organization (Multi-dimensional database)
slide18

AGGREGATION & CUBE

Region

Units

NE

SE

SW

North. E 40

South E. 20

South W. 30

Midwest 20

South W. 50

North E. 65

65

20

30

40

50

sales in the Northern region?

slide19

DIMENSIONS & CONCEPT HIERARCHIES

A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable.

Dimensions can have hierarchies (or various levels of aggregations)

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

slide20

CONCEPT HIERARCHY

Mfr. dimension

iPhone5

iPhone4

150,600

GALII

GAL

FX

G

G2

13,600

30,000

15,067

18,240

51,326

slide21

MULTI-DIMENSIONAL ORGANIZATION

Sales, costs etc.

nw

Products

(tables, desks, lamps..)

sw

Regions

Cube organization supports slice & dice

slide22

MULTI-DIMENSIONAL ORGANIZATION..

March sales

February sales

January sales

shows multi-dimensional/cube organization

slide23

CUBE ORGANIZATION

  • Data from warehouse imported into memory
  • A sophisticated 3D representation is created
  • Referred to as “sparse matrix”
  • Sides of cube are dimensions
  • Allows “slice & dice”
  • Answers to high level queries/reports
slide25

DATA MINING

Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data.

e.g. sequence/association, classification, and

clustering

  • Buyer non-buyer
  • Success/failure of an operation
  • Customers who spend more
  • Preferences of target groups
some data mining applications
SOME DATA MINING APPLICATIONS
  • Predicting the probability of default for consumer loans
  • Predicting audience response to TV advertisements
  • Predicting the probability that a cancer patient will respond to radiation therapy.
  • Predicting the probability that an offshore well will produce oil
  • A phone company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles
slide27

DATA MINING AS A STEP IN THE PROCESS

OF KNOWLEDGE DISCOVERY

Evaluation and

Presentation

Knowledge

Data Mining

Selection and

Transformation

Patterns

Data Warehouse

Cleaning and

Integration

Database

Flat files

slide28

DATA MINING PROCESS

Cleaning & integration – data is brought in from multiple sources

Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin.

Data mining - process of extracting data patterns, using statistical or AI techniques.

Pattern evaluation - identifying patterns useful and relevant to the organizational context.

Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user.

slide29

OLAP, MINING & WAREHOUSING...

Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers.

OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb?

Data mining refers to identification of patterns from data.

slide31

TYPES OF ANALYSES

Sequence -- Activities occurring one after another

e.g. loan after buying car, warranty.

Association -- (AKA Market Basket Analysis) Activities

which occur together (e.g. bread and meat)

Classification -- Identifying profiles of data classified

into pre-defined groups (frequent & infrequent

shoppers)

Clustering -- Identifying natural characteristics of data

(what major areas are cust. coming from?)

sequential analysis
SEQUENTIAL ANALYSIS

Predict purchasing

  • 40% of customers buy a gray skirt six months after buying a red jacket
  • Identified by time-series analysis, Neural networks or genetic algorithms
slide34

IDENTIFYING SEQUENTIAL PATTERNS..

applications in forecasting exchange rates, meat consumption , bankruptcies etc.

slide36

ASSOCIATION

Identifies items purchased together

  • Customers who bought items A, B, C also buy Y
  • 85 percent of customers who buy a certain wine brand also buy a certain type of pasta
  • Many men buy a six-pack when they purchase diapers
  • Enhances product placement in stores
analysis techniques association
ANALYSIS TECHNIQUES ASSOCIATION

Algorithm A-priori

  • list all items – 1 item set (C1)
  • filter by min. transaction support (L1)
  • identify 2-item sets (L1* L1) = (C2 )
  • filter and generate L2
  • repeat process...

Database

Min. transaction support is the number (sometimes given as %) of transactions

in which the item must occur.

a priori algorithm
A-PRIORI ALGORITHM

minimum

Support: 2

Scan

Database

C1

L1

Scan

Database

minimum

Support: 2

L2

C2

C2

Scan

Database

L3

C3

C3

slide39

DISCUSSION

Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two.

classification
CLASSIFICATION

A technique for grouping data into pre-defined classes

using certain attributes of the data. E.g. defaulter or not,

cruise customer, 4G subscriber or not etc.

  • Define classes
  • Identify classification criteria
  • Select suitable technique
  • applicationy
slide42

CLASSIFICATION TECHNIQUES

  • Discriminant analysis (categorical membership)
  • Bayesian classification
  • Logistic regression (probabilistic membership)
  • Neural networks
slide43

DISCRIMINANT ANALYSIS*

A method of classification that uses a Discriminant Function

to decide classes

  • DF based on attributes and weights.
  • Calculated as a weighted average of attributes and weights (of training data) (Zave score)

E.g. (GMAT + 200 * UGPA) > 1200

  • Given a new set of data, calculate Z score
  • Compare with Zav

DF – Discriminant Function

*a simplified version

slide44

CLASSIFICATION WITH

BAYESIAN ANALYSIS

slide45

REFRESHER ON PROBABILITY

Probability is the chance that an event/outcome will

take place.

  • E.g. Cubs had 43 wins 36 losses in one season
  • Probability of winning?
slide46

PRIOR (CONDITIONAL) PROBABILITIES

Prior probabilities are knowledge of other

events which may help improve predictions

  • E.g. p(IPO success) = 0.33
  • Suppose we know that a big company behind IPO
  • p(IPO success/big company) = 0.99

Which is higher?

P(successful cellphone call) or

P(successful call/subscriber in service area)

slide47

CONDITIONAL PROBABILITIES..

If we see a student in the union and he/she is a WIU student

What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?

slide48

BAYESIAN ANALYSIS

Bayes theorem can be exploited for classification

  • Bayes theorem
    • P(A/B) = [P(B/A) x P(A)]/P(B)
  • Classify into class I or II based on conditional probability
    • P(class I/x) > P(class II/x)? *
  • E.g. wiu student in union, which college are they?

A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events

*x is some condition e.g. surgery or being a shopper in a retail chain

slide49

INTRODUCTION TO

BAYESIAN ANALYSIS

We are interested in p(person becoming a manager/mba)*

How can we use Baye’s theorem?

ASSUME

  • 300 m population; 100 m employees
  • 500,000 are managers
  • 10,000 managers go to college for an MBA
  • 20 m go to college
  • 100,000 do MBA

*you need to write formula using terms from the problem

slide51

BAYESIAN ANALYSIS- EXAMPLE

PROBLEM:

An observer has collected information about

Eagles & Hawks for a long time. If a new bird

is spotted with a certain wingspan (x), need to know

whether eagle or hawk

From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski

slide52

BAYESIAN ANALYSIS- EXAMPLE

CLASSIFICATION PROBLEM:

  • Compare p(eagle/x) and p(hawk/x)
  • Whichever is higher  unknown bird
  • We don’t know p(eagle/x) or p(hawk/x)
slide53

BAYESIAN ANALYSIS- EXAMPLE

Bayes‘ theorem

Decision rule?

slide54

PROBABILITY DENSITY FUNCTION

Shows from observations of birds, the probability of

a bird having a particular wingspan

P(x|eagle),

P(x| hawk)

slide55

PROBABILITY OF EAGLE OR HAWK

N = Number of birds

= neagle + nhawk

= 0.8

= 0.2

slide56

USING BAYES THEOREM

New bird’s size = 45 cm

(from known probability

density functions)

p(45|eagle) = 2.22 x 10-2

p(45|hawk) = 1.10 x 10-2

2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2

0.01776 > 0.0021

Decision rule predicts eagle

data mining functions
DATA MINING FUNCTIONS

Clustering

Goal is to identify natural groupings of data. applicationications in market segmentation, discovering affinity groups, and defect analysis

  • based on euclidean distance
  • similar to centroid
  • trial & error approach
  • need to guess # of clusters
slide60

CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS

Income: Medium Children: 2

Income: High

Children: 1

Car: Luxury

Income: Medium

Children:3

Car: Sedan and Car: Truck

Cluster 4

Cluster 1

Cluster 3

Income: Low

Children: 0

Car: Compact

Cluster 2

slide63

DISCUSSION QUESTIONS

  • What kind of data is stored in a datawarehouse?
  • What are the characteristics of a data warehouse?
  • How is data organized in a warehouse?
  • What tool can we use to analyze data in a warehouse?
  • What type of multidimensional organization of data is also called a “sparse matrix”?
  • What is a concept hierarchy?
  • What technique can you use to predict the probability that a cancer patient will respond to radiation therapy?
  • What type of mining identifies natural characteristics of data?
  • What technique is useful for predicting trends?
  • What is minimum transaction support?
  • How is Bayes theorem used for classification?