300 likes | 424 Views
A data warehouse is a large database that stores historical data from various sources, allowing for advanced analytics and reporting. It supports Online Analytical Processing (OLAP), focusing on complex queries as opposed to immediate transactional results. Data transformation and schema design, often utilizing star schemas, are essential for effective data warehousing. Data mining, on the other hand, involves discovering patterns within large datasets to inform business decisions. Techniques such as association rules help identify relationships between purchased items, driving strategies for sales and marketing.
E N D
Data Warehouse • A data warehouse is a huge database that stores historical data • Example: Store information about all sales of products in all branches of a supermarket • Example: Information about all telephone calls made through a certain company
OLTP vs. OLAP • The queries we have seen so far were Online Line Transaction Processing, i.e., many small queries with results needed immediately • Typically, Online Line Analytical Processing is done against a data warehouse. • A lot of aggregation • Results aren’t needed immediately (queries can run for hours) • Used for decision support (more about this later)
Creating a Data Warehouse • Data from different sources is combined • need to transform data to a single schema • need to transform data to have same type of measurements (e.g., same currency) • Occasionally, data is refreshed (get new data) • Occasionally, data is purged (throw out old data) • Need a meta-data repository to store information for the above
Design • Usually, a star schema: • large fact table that continuously grows • small dimension tables that remain basically static • Example: • Sales(pid, timeid, locid, amount) • Products(pid, pname, category, price) • Locations(locid, city, start, country) • Times(timeid, date, week, month, holiday_flag) Sales is the fact table, the rest are dimensions
Normal Forms • Fact Table: In BCNF • Dimension Tables: Generally not in BCNF • few updates, insert, deletes – few anomalies • data is relatively small so duplication doesn’t matter • Allow queries to be faster since less joins are needed
Data Mining • Data Mining is the process of finding interesting trends or patterns in large datasets in order to guide future decisions. • Related to exploratory data analysis (area of statistics) and knowledge discovery (area in artificial intelligence, machine learning). • Data Mining is characterized by having VERY LARGE datasets.
Some Real Problems • Associations between products bought in a store • milk • beers and diapers • Deciding on product discounts • Figuring out how to keep customer at lowest cost to company • Personal recommendation page at Amazon • Stock market trends
Data Mining vs. Machine Learning • Size: Databases are usually very large so algorithms must scale well • Design Purpose: Databases are not usually designed for data mining (but for other purposes), and thus, may not have convenient attributes • Errors and Noise: Databases almost always contain errors
Association Rules • A market basket is a collection of items bought by a single customer in a single customer transaction • Problem: Identify sets of items that are purchased together • Attempt to identify rules of the form {pen} {ink} Meaning: If a pen is bought in a transaction, it is likely that ink will also be bought
Measures for Association Rules • General form: LR, where L and R are sets of items • Support: The support for LR is the percentage of transactions containing all items from L and from R • Confidence: The confidence for LR are the percentage of transactions that contain R, from among the transactions that contain L
Examples • The rule {pen}{ink} has: • support: • confidence: • The rule {tissues}{ink} has: • support: • confidence: • The rule {pen}{soap} has: • support: • confidence:
Understanding the Measures: The Intuition • If a rule has low support then it might have arisen by chance. There is not enough evidence to draw a conclusion. • If a rule LR has low confidence then it is likely that there is no relationship between buying L and buying R • Note: LR and RL will always have the same support, but may have different confidence
Goal • Find association rules with high support and high confidence. Support and confidence values are specified by the user. • Remember: Finding such a rule does not mean that there must be a relationship between the left and right sides. A person must evaluate such rules by hand. • Example: {milk} {bread}
Algorithm • Given values s and c, find all association rules with support >= s and confidence >= c. • Can be done in 2 steps. • Step 1: Find frequent itemsets, i.e., sets of items that appear with support >= s • Step 2: For each frequent itemset F, try all ways to partition F to L and R and check if the rule generated will have confidence >= c.
Finding Frequent Itemsets • How would you do this? Think!
Finding Frequent Itemsets • The A Priori Property: Every subset of a frequent itemset must also be a frequent itemset. • This means that we can create frequent itemsets iteratively, by taking frequent itemsets of size n, and extending them to frequent itemsets of size n+1.
Finding Frequent Itemsets (2) Freq = {} scan all transactions once and add to Freq the items that have support > s k = 1 repeat foreachIkin Freq with k items generate all itemsets Ik+1with k+1 items, such that Ikis contained in Ik+1 scan all transactions once and add to Freq the k+1-itemsets that have support > s k++ until no new frequent itemsets are found
Example Run the algorithm with support = 0.7 • Level 1: Find frequent itemsets: {pen}, {ink}, {diary} • Level 2: Check each of: {pen, ink}, {pen, diary}, {pen, soap}, {pen, tissues}, {ink, diary}, {ink, soap}, {ink, tissues}, {diary, soap}, {diary, tissues} The frequent itemsets are: {pen, ink}, {pen, diary}
Example • Level 3: Check each of: {pen, ink, diary}, {pen, ink, soap}, {pen, ink, tissues}, {pen, diary, soap}, {pen, diary, tissues} No frequent itemsets! • To summarize, the frequent itemsets were: {pen}, {ink}, {diary} {pen, ink}, {pen, diary}
Refinements to the Algorithm • Note that we checked if {pen, tissues} is a frequent itemset even though we already knew that {tissues} is not a frequent itemset. Refinement: Try to extend frequent itemsets in a fashion that will ensure that all their subsets are frequent itemsets • Scanning the transactions can be expensive if the table does not fit in main memory. Refinement: Find rules over a sample of the database that fits in memory
Deriving Association Rules foreach frequent itemset I foreach partition of I to two sets L, R generate a candidate rule LR foreach transaction T in the database foreach candidate rule LR if L in T then lnum(LR)++ if R in T then rnum(LR)++ return all rules LR with rnum(LR)/lnum(LR) > c
Other Types of Rules • Sequential Patterns • Each transaction for each customer is a set of items • A sequence of transactions is a sequence of sets. For example: {pen, ink, soap}, {pen, ink diary, soap} • A subsequence is derived from a sequence by deleting some itemsets and some items in other itemsets
Sequence Patterns (2) • {pen}, {ink, diary}, {pen, soap} is a subsequence of {pen, ink}, {shirt}, {milk, ink, diary}, {soap, pen, diary} • {pen}, {ink, diary}, {pen, soap} is not a subsequence of {pen, ink}, {shirt}, {soap, pen, diary}, {milk, ink, diary}
Sequence Patterns (3) • The support for a sequence pattern S is the percentage of customer sequences that have S as a subsequence. • A sequence s1, s2, ..., sn with high support means that someone who buys s1 is likely to buy s2 later on, etc. • Finding sequence patterns: In a similar fashion to finding frequent itemsets. Start with small sequences and try to extend them
Classification Rules • Consider a table: BankInfo(custid, balance, age, returnLoan) • Would like to find rules of the type: “If age is in a certain range and balance is in a certain range, then a customer is likely to not return a loan.” • We can apply this rule to new customers who ask for a loan.
Classification Rules (2) • General Form: (l1<X1<h1) and ... and (lk < Xk < hk)Y=c • Use values for X1,...,Xk to “predict” value for Y • We can define support and confidence as before.