- By
**abla** - Follow User

- 148 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI' - abla

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

- BI systems (aka EIS)
- Geographical Information Systems (GIS)
- Collaborative Systems (formerly GDSS)
- Expert Systems
- OLAP/Data mining/warehousing

A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining)

Dashboard/BI

Internal

Data

Data

Visualization

Data

Marts

OLAP

External

Data

Data

Warehouse

Decisions

Internal

Data

Data Mining

Data

Marts

Knowledge

The activities taking place with respect to data for warehouse/OLAP/mining

Integration

Operational databases

Data Warehouse

application A – m,f M/F

application B – 1,0

application C – x,y

application D – male, female

Data warehouses have a number of characteristics

- Subject-oriented
- Integrated
- Time-variant
- Non-volatile

KEY FEATURES OF DATA WAREHOUSES…

Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales.

Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records.

Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time

Nonvolatile: The data in a warehouse is permanent.

Design of warehouses is similar to databases:

- Schema not radically different
- W/H designed primarily for queries
- Database will be converted to cube at run time.

STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension.

SNOWFLAKE SCHEMA: A variant of the star schema,

Where some dimension tables are normalized, thereby splitting the data into additional tables.

CONSTELLATION SCHEMA: A collection of stars.

CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI)

DEPARTMENTS

BRANCH

ACCOUNTS

EMPLOYEES

TRANSACTIONS

CREDIT

DEBIT

Weekly sales by region

2007-2013

Sales by Product Line

2010-2013

Weekly sales by state

2005-2013

Weekly sales by product

2008-2013

Data Mart

Data Mart

Warehouse

Sales Detail

Sales Detail

2003-2013

OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December?

- Aggregation (“data summarization”)
- Dimension
- Slice & Dice
- Cube organization (Multi-dimensional database)

Region

Units

NE

SE

SW

North. E 40

South E. 20

South W. 30

Midwest 20

South W. 50

North E. 65

65

20

30

40

50

sales in the Northern region?

DIMENSIONS & CONCEPT HIERARCHIES

A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable.

Dimensions can have hierarchies (or various levels of aggregations)

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

MULTI-DIMENSIONAL ORGANIZATION

Sales, costs etc.

nw

Products

(tables, desks, lamps..)

sw

Regions

Cube organization supports slice & dice

MULTI-DIMENSIONAL ORGANIZATION..

March sales

February sales

January sales

shows multi-dimensional/cube organization

- Data from warehouse imported into memory
- A sophisticated 3D representation is created
- Referred to as “sparse matrix”
- Sides of cube are dimensions
- Allows “slice & dice”
- Answers to high level queries/reports

Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data.

e.g. sequence/association, classification, and

clustering

- Buyer non-buyer
- Success/failure of an operation
- Customers who spend more
- Preferences of target groups

SOME DATA MINING APPLICATIONS

- Predicting the probability of default for consumer loans
- Predicting audience response to TV advertisements
- Predicting the probability that a cancer patient will respond to radiation therapy.
- Predicting the probability that an offshore well will produce oil
- A phone company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles

DATA MINING AS A STEP IN THE PROCESS

OF KNOWLEDGE DISCOVERY

Evaluation and

Presentation

Knowledge

Data Mining

Selection and

Transformation

Patterns

Data Warehouse

Cleaning and

Integration

Database

Flat files

Cleaning & integration – data is brought in from multiple sources

Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin.

Data mining - process of extracting data patterns, using statistical or AI techniques.

Pattern evaluation - identifying patterns useful and relevant to the organizational context.

Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user.

Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers.

OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb?

Data mining refers to identification of patterns from data.

Sequence -- Activities occurring one after another

e.g. loan after buying car, warranty.

Association -- (AKA Market Basket Analysis) Activities

which occur together (e.g. bread and meat)

Classification -- Identifying profiles of data classified

into pre-defined groups (frequent & infrequent

shoppers)

Clustering -- Identifying natural characteristics of data

(what major areas are cust. coming from?)

SEQUENTIAL ANALYSIS

Predict purchasing

- 40% of customers buy a gray skirt six months after buying a red jacket
- Identified by time-series analysis, Neural networks or genetic algorithms

IDENTIFYING SEQUENTIAL PATTERNS..

applications in forecasting exchange rates, meat consumption , bankruptcies etc.

Identifies items purchased together

- Customers who bought items A, B, C also buy Y
- 85 percent of customers who buy a certain wine brand also buy a certain type of pasta
- Many men buy a six-pack when they purchase diapers
- Enhances product placement in stores

ANALYSIS TECHNIQUES ASSOCIATION

Algorithm A-priori

- list all items – 1 item set (C1)
- filter by min. transaction support (L1)
- identify 2-item sets (L1* L1) = (C2 )
- filter and generate L2
- repeat process...

Database

Min. transaction support is the number (sometimes given as %) of transactions

in which the item must occur.

A-PRIORI ALGORITHM

minimum

Support: 2

Scan

Database

C1

L1

Scan

Database

minimum

Support: 2

L2

C2

C2

Scan

Database

L3

C3

C3

Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two.

CLASSIFICATION

A technique for grouping data into pre-defined classes

using certain attributes of the data. E.g. defaulter or not,

cruise customer, 4G subscriber or not etc.

- Define classes
- Identify classification criteria
- Select suitable technique
- applicationy

- Discriminant analysis (categorical membership)
- Bayesian classification
- Logistic regression (probabilistic membership)
- Neural networks

A method of classification that uses a Discriminant Function

to decide classes

- DF based on attributes and weights.
- Calculated as a weighted average of attributes and weights (of training data) (Zave score)

E.g. (GMAT + 200 * UGPA) > 1200

- Given a new set of data, calculate Z score
- Compare with Zav

DF – Discriminant Function

*a simplified version

BAYESIAN ANALYSIS

Probability is the chance that an event/outcome will

take place.

- E.g. Cubs had 43 wins 36 losses in one season
- Probability of winning?

PRIOR (CONDITIONAL) PROBABILITIES

Prior probabilities are knowledge of other

events which may help improve predictions

- E.g. p(IPO success) = 0.33
- Suppose we know that a big company behind IPO
- p(IPO success/big company) = 0.99

Which is higher?

P(successful cellphone call) or

P(successful call/subscriber in service area)

If we see a student in the union and he/she is a WIU student

What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?

Bayes theorem can be exploited for classification

- Bayes theorem
- P(A/B) = [P(B/A) x P(A)]/P(B)
- Classify into class I or II based on conditional probability
- P(class I/x) > P(class II/x)? *
- E.g. wiu student in union, which college are they?

A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events

*x is some condition e.g. surgery or being a shopper in a retail chain

BAYESIAN ANALYSIS

We are interested in p(person becoming a manager/mba)*

How can we use Baye’s theorem?

ASSUME

- 300 m population; 100 m employees
- 500,000 are managers
- 10,000 managers go to college for an MBA
- 20 m go to college
- 100,000 do MBA

*you need to write formula using terms from the problem

PROBLEM:

An observer has collected information about

Eagles & Hawks for a long time. If a new bird

is spotted with a certain wingspan (x), need to know

whether eagle or hawk

From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski

CLASSIFICATION PROBLEM:

- Compare p(eagle/x) and p(hawk/x)
- Whichever is higher unknown bird
- We don’t know p(eagle/x) or p(hawk/x)

Shows from observations of birds, the probability of

a bird having a particular wingspan

P(x|eagle),

P(x| hawk)

New bird’s size = 45 cm

(from known probability

density functions)

p(45|eagle) = 2.22 x 10-2

p(45|hawk) = 1.10 x 10-2

2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2

0.01776 > 0.0021

Decision rule predicts eagle

DATA MINING FUNCTIONS

Clustering

Goal is to identify natural groupings of data. applicationications in market segmentation, discovering affinity groups, and defect analysis

- based on euclidean distance
- similar to centroid
- trial & error approach
- need to guess # of clusters

CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS

Income: Medium Children: 2

Income: High

Children: 1

Car: Luxury

Income: Medium

Children:3

Car: Sedan and Car: Truck

Cluster 4

Cluster 1

Cluster 3

Income: Low

Children: 0

Car: Compact

Cluster 2

- What kind of data is stored in a datawarehouse?
- What are the characteristics of a data warehouse?
- How is data organized in a warehouse?
- What tool can we use to analyze data in a warehouse?
- What type of multidimensional organization of data is also called a “sparse matrix”?
- What is a concept hierarchy?
- What technique can you use to predict the probability that a cancer patient will respond to radiation therapy?
- What type of mining identifies natural characteristics of data?
- What technique is useful for predicting trends?
- What is minimum transaction support?
- How is Bayes theorem used for classification?

Download Presentation

Connecting to Server..