DATA MINING: AN INTRODUCTION Presented by SANDEEP PARAITHAM e-mail: firstname.lastname@example.org Ph: 800 888 5484 (Mobile) An Introduction to Data Mining
Fundamentals of Data Mining Fundamentals of DWH DWH + Data Mining Data Mining Functionalities Classification of Data Mining Systems Major Issues in Data Mining AGENDA An Introduction to Data Mining
What is DATA ?? • Collection of facts from which conclusions may be drawn • Data (Plural of Datum) are the components of information • They may be the raw facts and figures before interpretation • Chunks that represent the qualitative or quantitative attributes of a variable or set of variables • In computer science, • data is anything in a form suitable for use with a computer • Data is often distinguished from programs • A program is a set of instructions that detail a task for the computer to perform • In this sense, data is thus everything that is not program code • They may be 1's and 0's of computer memory, names and addresses in a file An Introduction to Data Mining
What is Information ??? • Knowledge acquired through data i.e. by study or experience or instruction • Things that can be known about a topic • Data that has/ have been processed into a format that is understandable by its intended audience • Meaningless data becomes information when it is processed and presented to the decision maker in a meaningful way An Introduction to Data Mining
Data Vs Information An Introduction to Data Mining
Data Everywhere An Introduction to Data Mining • Lots of data is being collected and warehoused • Web data, e-commerce • Purchases at department/ grocery stores • Bank/ Credit Card transactions • Hospitals/ Clinics (Diagnostic Centers) • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks • Science: Remote sensing, bioinformatics, scientific simulation • Society: news, digital cameras, YouTube
Data Everywhere (Contd…) • Data collected and stored at enormous speeds (GB/hour) • Remote sensors on a satellite • Telescopes scanning the skies • Microarrays generating gene expression data • Scientific simulations generating • terabytes of data An Introduction to Data Mining
Data Rich but Information Poor An Introduction to Data Mining • The Explosive Growth of Data: from Gigabytes to Terabytes to Petabytes and more • Data collection and data availability is on the rise • This may be due to Automated data collection tools, Database systems, Web, Computerized society • We are drowning in data, but starving for knowledge!
What Next after getting Data? Some Real time Scenarios An Introduction to Data Mining • Credit ratings/ targeted marketing: • Given a database of 1,00,000 names, which persons are the least likely to default on their credit cards? • Identify likely responders to sales promotions • Fraud detection • Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? • Customer relationship management: • Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor?
Why Data Mining? An Introduction to Data Mining • Information is often “hidden” in the data but is not readily evident • Human analysts may take weeks to discover useful information • Data volumes are too large for classical analysis approaches: • Large number of records • High dimensional data • Leverage organization’s data assets • Much of the data is never analyzed at all • Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missing.
Why Data Mining? (Contd…) An Introduction to Data Mining • As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible. • Query formulation problem • Traditional techniques infeasible for raw data • Data Mining helps extract such information • “Necessity is the mother of invention” • Data mining - Automated analysis of massive data sets • Data mining may help scientists in • Classifying and segmenting data • Hypothesis Formation
What is Data Mining? An Introduction to Data Mining • Data mining (knowledge discovery from data) • Extraction & Analysis of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining refers to the discovery of new information in terms of patterns or rules from vast amounts of data • Process of semi-automatically analyzing large databases to find patterns that are: • valid: hold on new data with some certainty • novel: non-obvious to the system • useful: should be possible to act on the item • understandable: humans should be able to interpret the pattern • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
KNOWLEDGE DISCOVERY PROCESS Knowledge Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration An Introduction to Data Mining Databases
STEPS IN KDD PROCESS • Data Mining • (An essential process where intelligent methods are applied in order to extract data patterns) • Pattern Evaluation • (To identify truly interesting patterns representing knowledge based on some interesting measures) • Knowledge Presentation • (Visualization, Knowledge representation techniques are used to present the mined knowledge to the user) An Introduction to Data Mining • Data Cleaning • (to remove noise and inconsistent data) • Data Integration • (Combining multiple data sources) • Data Selection • (Data relevant to the analysis are retrieved from the database) • Data Transformation • (Data is transformed into forms appropriate for the mining by performing few operations like aggregation)
Graphical User Interface Pattern Evaluation Knowledge-Base Data Mining Engine Database or Data Warehouse Server data cleaning, integration, and selection Other Info Repositories Data Warehouse World-Wide Web Database ARCHITECTURE: TYPICAL DATA MINING SYSTEM An Introduction to Data Mining
Data Mining on What Kind of Data? An Introduction to Data Mining • Data Mining should be applicable to any kind of data repositories and data streams • Data repositories • Transactional Databases • Data Warehouses • Advanced database systems • Data Streams, Flat files • World Wide Web (WWW) • Advanced database systems • Object – Relational database • Spatial databases • Time – Series databases • Multimedia databases
Transactional Databases An Introduction to Data Mining • A file in which each record represents a transaction • (Generally includes a trans_id and list of items which make a transaction) • Queries: • Show me all items purchased by ‘X’ • How many transactions include Item ‘XYZ’? • Advanced: • Which items sell good together? (Strategy to maximize sales) • Data Mining systems for transactional data retrieve such information by identifying frequent item sets
DATA WAREHOUSING An Introduction to Data Mining
What is an Operational DBMS? An Introduction to Data Mining They consist of Tables having attributes and are populated by tuples. They generally use the E-R data model. It is used to store transactional data. The information content is generally recent. These are thus called as OLTP systems. Their goals are data accuracy & consistency, Concurrency, Recoverability, Reliability (ACID Properties).
WHAT IS A DWH? An Introduction to Data Mining • Definition: • “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management decision making process.” • Explanation: • Subject-Oriented: Stored data targets specific subjects. • Example: It may store data regarding total Sales, Number of Customers, etc. and not general data on everyday operations. • Integrated: Data may be distributed across heterogeneous sources which have to be integrated. • Example: Sales data may be on RDB, Customer information on Flat files, etc. • Time Variant: Data stored may not be current but varies with time and data have an element of time. • Example: Data of sales in last 5 years, etc. • Non-Volatile: It is separate from the Enterprise Operational Database and hence is not subject to frequent modification. It generally has only two operations performed on it: Loading of data and Access of data.
FEATURES OF DWH An Introduction to Data Mining It is separate from Operational Database. Integrates data from heterogeneous systems. Stores HUGE amount of data, more historical than current data. Does not require data to be highly accurate. Queries are generally complex. Goal is to execute statistical queries and provide results whichcan influence decision making in favor of the Enterprise. These systems are thus called Online Analytical Processing Systems (OLAP).
OPERATIONAL DB Vs DWH An Introduction to Data Mining
Why Separate DWH? An Introduction to Data Mining • In general OLTP systems provide good performance for short and simple OLTP queries. • Whereas, an OLAP query is very complex and does not require these properties (high concurrency, reliability, locking) • Use of OLAP query on OLTP system degrades its performance. • OLAP systems access historical data and not current volatile data while OLTP systems access current up-to-date data and do not need historical data. • Solution is to have a separate database system which supports primitives and structures suitable to store, access and process OLAP specific data • Hence a separate Data Warehouse.
DATA REPRESENTATION IN A DWH An Introduction to Data Mining • Data is stored as Data Cubes • In simple words: • Subject(s)/ Measure(s) per Dimension • Example: • Subject/ measure - quantity sold • Dimensions - Item Type, Location and Period • Data warehouse stores • The items sold per type, per geographical location during the particular period.
Cube: A Lattice of Cuboids all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier
Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
supplier item time item_key item_name brand type supplier_key supplier_key supplier_type time_key day day_of_the_week month quarter year city location branch city_key city province_or_street country location_key street city_key branch_key branch_name branch_type Example of Snowflake Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location location_key street city province_or_street country shipper branch shipper_key shipper_name location_key shipper_type branch_key branch_name branch_type Example of Fact Constellation Shipping Fact Table time_key Sales Fact Table item_key time_key shipper_key item_key from_location branch_key to_location location_key dollars_cost units_sold units_shipped dollars_sold avg_sales Measures
A Data Mining Query Language, DMQL: Language Primitives • Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> • Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) • Special Case (Shared Dimension Tables) • First time as “cube definition” • define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>
Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)
Defining a Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
Defining a Snowflake Schema in DMQL define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country))
Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)
Defining a Fact Constellation in DMQL define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
Measures: Three Categories • distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. • E.g., count(), sum(), min(), max(). • algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. • E.g., avg(), min_N(), standard_deviation().
Measures: Three Categories • holistic: if there is no constant bound on the storage size needed to describe a sub aggregate. • E.g., median(), mode(), rank().
A Concept Hierarchy: Dimension (location) all all Europe ... North_America region Germany ... Spain Canada ... Mexico country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office
Multidimensional Data • Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Region Industry Region Year Category Country Quarter Product City Month Week Office Day Product Month
Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum All, All, All A Sample Data Cube Total annual sales of TV in U.S.A.
Cuboids Corresponding to the Cube all 0-D(apex) cuboid country product date 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country
OLAP Operations • Roll up (drill-up): summarize data • by climbing up hierarchy or by dimension reduction • Drill down (roll down): reverse of roll-up • from higher level summary to lower level summary or detailed data, or introducing new dimensions • Slice and dice: • project and select
OLAP Operations • Pivot (rotate): • reorient the cube, visualization, 3D to series of 2D planes. • Other operations • drill across: involving (across) more than one fact table • drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Lecture-9 Data warehouse architecture
Steps for the Design and Construction of Data Warehouse • The design of a data warehouse: a business analysis framework • The process of data warehouse design • A three-tier data ware house architecture
Design of a Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse • Top-down view • allows selection of the relevant information necessary for the data warehouse
Design of a Data Warehouse: A Business Analysis Framework • Data warehouse view • consists of fact tables and dimension tables • Data source view • exposes the information being captured, stored, and managed by operational systems • Business query view • sees the perspectives
Data Warehouse Design Process • Top-down, bottom-up approaches or a combination of both • Top-down: Starts with overall design and planning (mature) • Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view • Waterfall: structured and systematic analysis at each step before proceeding to the next • Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around