Tutorial on Data Mining. Workshop of the Indian Database Research Community Sunita Sarawagi School of IT, IIT Bombay. Data mining. Process of semi-automatically analyzing large databases to find interesting and useful patterns
Workshop of the Indian Database Research Community
School of IT, IIT Bombay
Salary > 5 L
Prof. = Exec
New applicant’s data
Goal: Predict class Ci = f(x1, x2, .. Xn)
Salary < 1 M
Prof = teacher
Age < 30
Gen_Tree (Node, data)
make node a leaf?
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
r =1, k=2
rid A1 A2 A3 C
rid to L/R hash in memory.
A2 C rid
A3 C rid
A1 C rid
More information: http://www.stat.wisc.edu/~limt/treeprogs.html
Basic NN unit
A more typical NN
Conclusion: Use neural nets only if decision trees/NN fail.
ad ad adad
0.1 0.2 0.3 0.4
Variable e independent
of d given b
0.3 0.2 0.1 0.5
EM algorithm: K Gaussian mixtures
Tea, rice, bread
Correlation between milk and cereal remains roughly constant over time
Cannot be trivially derived from simpler rules
Milk 10%, cereal 10%
Milk and cereal 10% … surprising
Milk, cereal and eggs 0.1% … surprising!
Expected 1%What makes a rule surprising?
Find correlated events:
Identify complex operations with specific OLAP needs in mind (what does an analyst need?) rather than looking at mining operations and choosing what fits
Need to build usable prototypes not simply tweak algorithms for publications.