230 likes | 393 Views
Part-2 Qualifying Exam. Jiaqi Ge Department of Computer and Information Science Indiana University Purdue University Indianapolis June 20, 2011. 1. Roadmap. David H., “A Tutorial on Learning with Bayesian Networks.”
E N D
Part-2 Qualifying Exam JiaqiGe Department of Computer and Information Science Indiana University Purdue University Indianapolis June 20, 2011 1
Roadmap • David H., “A Tutorial on Learning with Bayesian Networks.” • P. Sen, A. Deshpande, “Representing and Querying Correlated Tuples in Probabilistic Databases” • H. Kriegel, M. Pfeifle, “Density-Based Clustering of Uncertain Data.” • W. DuMouchel, D. Pregibon, “Empirical Bayes Screening for Multi-Item Associations” • C. C. Aggarwal, P. Yu, “A Survey of Uncertain Data Algorithms and Applications.”
A Tutorial on Learning with Bayesian Networks David Heckerman Technical Report MSR-TR-95-06
Bayesian Networks • A joint probability distribution of variables • Probabilistic inference • Learning parameters p(ɵs| D, Sh) from data • No missing data • Incomplete data
Bayesian Networks (cont.) • Learning Structure from data • Sh is exponential in n. • Criteria for Model Selection • Search Methods • Structure and causal graph
Bayesian Networks (cont.) • Advantage • Complete analysis of correlations between variables • Combination of domain knowledge and data • Disadvantage • Complexity of Learning Bayesian Networks purely from data is exponential
Representing and Querying Correlated Tuples in Probabilistic Databases PrithvirajSen, AmolDeshpande ICDE, 2007
A model to capture tuple correlations • Existential probabilistic database • A probabilistic distribution pr(X) over all possible worlds • Query evaluation • Intermediate tuple • Inference in probabilistic graph model
Advantage • Model tuple correlations in probabilistic databases • Cast query evaluation to probabilistic inference • Issues • Inference in general probabilistic graph is NP-hard • Refine the graph model beyond direct combining the operators • Use an approximate approach in inference
Density-Based Clustering of Uncertain Data Hans-Peter Kriegel, Martin Pfeifle SIGKDD 2005
FDBSCAN • Integrate distance probability distribution in clustering • Distance pdf: Pd(o1,o2) • Core Object Probability • Reachability Probability • Add p to cluster, if Preach(p,o) > 0.5
FDBSCAN • Advantage • Integrate distance pdf in uncertain clustering • Experiments show that FDBSCAN outperforms other algorithms, with both high recall and precision • Issues • Distance pdf is approximated by sampling • The upper bound error rate of this approximation has not been stated • The global threshold in Preach(p,o) > 0.5 is arbitrary
Empirical Bayes Screening for Multi-Item Association William DuMouchel, Daryl Pregibon SIGKDD 2001
A criteria to assert association • A smoothed criterion to analyze correlation • R (Lift): sufficient for large supported Itemset • Empirical Bayes estimation: λ • Lower support • Reduce effect of noise • Given pairs (n,e), n > n*, and n~ Poi(λe)
EXCESS2 • To find the multi-item association what cannot be explained by pairwise association • New baseline probability expectation • eAll2F = predicted frequency of all-two-factor model based on two-way distribution
Advantage • measure association in not that frequent itemset • Robust to noise
A Survey of Uncertain Data Algorithms and Applications Charu C. Aggarwal, Philip S. Yu TKDE, 2007.
Uncertain Model • Possible worlds models • Probabilistic ?- table (tuple-level uncertain) • Independency assumption (inconsistency) • Probabilistic or-set table (Attribute-level uncertain) • Attribute modeled by its pdf
Query Processing • Two semantics • Intension semantics • Complex, Accurate • Extension Semantics • Efficient, Approximate • Query with Correlations
Indexing Uncertain Data • Nearest neighbor query • Probabilistic threshold query • Uncertain categorical data • Probabilistic equality query • Probabilistic equality threshold query • Distributional similarity threshold query • Join Processing • Probabilistic join query • Probabilistic similarity join
Data mining Applications • Clustering • FDBSCAN • UK-Means • Classification • SVM • Frequent Pattern Mining • U-Aprior • On Density based general approach
Thanks! • Questions?