MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI. EXTENSIONS TO DSS. BI systems (aka EIS) Geographical Information Systems (GIS) Collaborative Systems (formerly GDSS) Expert Systems OLAP/Data mining/warehousing. DATA WAREHOUSES. DATA WAREHOUSE.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CHANDRA S. AMARAVADI
A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining)
The activities taking place with respect to data for warehouse/OLAP/mining
application A – m,f M/F
application B – 1,0
application C – x,y
application D – male, female
Data warehouses have a number of characteristics
Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records.
Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time
Nonvolatile: The data in a warehouse is permanent.
Design of warehouses is similar to databases:
STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension.
SNOWFLAKE SCHEMA: A variant of the star schema,
Where some dimension tables are normalized, thereby splitting the data into additional tables.
CONSTELLATION SCHEMA: A collection of stars.
What are the dimensions here?
CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI)
Weekly sales by region
Sales by Product Line
Weekly sales by state
Weekly sales by product
OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December?
North. E 40
South E. 20
South W. 30
South W. 50
North E. 65
sales in the Northern region?
A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable.
Dimensions can have hierarchies (or various levels of aggregations)
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts
Sales, costs etc.
(tables, desks, lamps..)
Cube organization supports slice & dice
shows multi-dimensional/cube organization
Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data.
e.g. sequence/association, classification, and
OF KNOWLEDGE DISCOVERY
Cleaning & integration – data is brought in from multiple sources
Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin.
Data mining - process of extracting data patterns, using statistical or AI techniques.
Pattern evaluation - identifying patterns useful and relevant to the organizational context.
Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user.
Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers.
OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb?
Data mining refers to identification of patterns from data.
Sequence -- Activities occurring one after another
e.g. loan after buying car, warranty.
Association -- (AKA Market Basket Analysis) Activities
which occur together (e.g. bread and meat)
Classification -- Identifying profiles of data classified
into pre-defined groups (frequent & infrequent
Clustering -- Identifying natural characteristics of data
(what major areas are cust. coming from?)
applications in forecasting exchange rates, meat consumption , bankruptcies etc.
Identifies items purchased together
Min. transaction support is the number (sometimes given as %) of transactions
in which the item must occur.
Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two.
A technique for grouping data into pre-defined classes
using certain attributes of the data. E.g. defaulter or not,
cruise customer, 4G subscriber or not etc.
A method of classification that uses a Discriminant Function
to decide classes
E.g. (GMAT + 200 * UGPA) > 1200
DF – Discriminant Function
*a simplified version
Probability is the chance that an event/outcome will
Prior probabilities are knowledge of other
events which may help improve predictions
Which is higher?
P(successful cellphone call) or
P(successful call/subscriber in service area)
If we see a student in the union and he/she is a WIU student
What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?
Bayes theorem can be exploited for classification
A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events
*x is some condition e.g. surgery or being a shopper in a retail chain
We are interested in p(person becoming a manager/mba)*
How can we use Baye’s theorem?
*you need to write formula using terms from the problem
An observer has collected information about
Eagles & Hawks for a long time. If a new bird
is spotted with a certain wingspan (x), need to know
whether eagle or hawk
From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski
Shows from observations of birds, the probability of
a bird having a particular wingspan
N = Number of birds
= neagle + nhawk
New bird’s size = 45 cm
(from known probability
p(45|eagle) = 2.22 x 10-2
p(45|hawk) = 1.10 x 10-2
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 > 0.0021
Decision rule predicts eagle
Goal is to identify natural groupings of data. applicationications in market segmentation, discovering affinity groups, and defect analysis
Income: Medium Children: 2
Car: Sedan and Car: Truck