Data Mining

Data Mining Lecture 1 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

Data Mining • 12-14 lectures (on weeks 44-50) • Mondays 12:15-14:00 • Tuesdays 10:15-12:00 • NOTE: No lectures on week 47 • 3 x 2h demonstrations (one weeks 48-50 in a computer classroom) • Final exam in January 2008 • 3cr without seminar work • 5cr with seminar work (will be held in January 2008)

About lectures • The lectures are based on: • Han and Kamber (based on Data Mining: Concepts and Techniques) • http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html • Tan, Steinbach and Kumar (based on Introduction to Data Mining) • http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4 • Some slides by the lecturer

Literature • P-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2005. • J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2005. • D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001. • D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999. • M. Berry, Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, Wiley, 2004. • T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning:Data Mining, Inference, and Prediction, Springer-Verlag, 2001. • U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. • M.H. Dunham, Data Mining Introductory and Advanced Topics, Prentice Hall, 2003. • F. Witten, Data Mining: Practical Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. • J.P. Bigus, Data Mining with Neural Networks, McGraw-Hill, 1996. • J-M- Adamo, Data Mining for Association Rules and Sequential Patterns: Sequential and Parallel Algorithms, Springer-Verlag, 2001. • H. Liu and H., Motoda, Feature Selection for Knowledge Discovery and Data mining, Kluwer, 1998.

Theses, publications etc. M. Pechenizkiy, Feature Extraction for Supervised Learning in Knowledge Discovery Systems, PhD thesis, University of Jyväskylä, 2005. S. Äyrämö, Knowledge Mining using Robust Clustering, PhD thesis, University of Jyväskylä, 2006. J. Mäkinen, Roskapostin älykäs suodattaminen, Pro gradu, Jyväskylän yliopisto, 2003. M. Nurminen, Tiedonlouhinta rakenteisista dokumenteista, Pro gradu, Jyväskylän yliopisto, 2005. K. Arkko, Assosiaatioiden ja sekvenssien louhinta suurista tietomassoista, Pro gradu, Jyväskylän yliopisto, 2006. J. Hänninen, Batch- ja online-hermoverkko-opetusalgoritmien ominaisuudet ja eroavaisuudet, Pro gradu, Jyväskylän yliopisto, 2006. Kärkkäinen, T., MLP-network in a layer-wise form with applications to weight decay. Neural Computing, 14 (6), 1451-1480, 2002. Kärkkäinen, T. & Heikkola, E., Robust Formulations for Training Multilayer Perceptrons. Neural Computation, 16 (4), 837-862, 2004. Kärkkäinen, T. and Äyrämö, S., Robust Clustering Methods for Incomplete and Erroneous Data, in Data Mining V: Data Mining, Text Mining and their Business Applications, 2004. Äyrämö, S., Kärkkäinen, T. & Majava, K., Robust refinement of initial prototypes for partitioning-based clustering algorithms. In C. Skiadas (Eds.), Recent Advances in Stochastic Modeling and Data Analysis, pp. 473-482, World Scientific, 2007. ...many more!

Journals, conferences,… • Journals • Data Mining and Knowledge Discovery, Springer • The Transactions on Knowledge Discovery from Data (TKDD), ACM • IEEE Transactions on Knowledge and Data Engineering, IEEE • SIGKDD Explorations • Statistical Analysis and Data Mining, Wiley • Data & Knowledge Engineering, Elsevier • Computational Statistics & Data Analysis, Elsevier • Conferences, seminars, workshops • ACM SIGKDD, PKDD, PAKDD, (IEEE) ICDM, SIAM data mining (SDM), DMIN,... • ICTAI, IJCAI, VLDB, ICDE, ICML, CVPR, MSR,...

Operator Laborant Quality Process data Process data Customer Control data ? Manager Feedback Sample application

Real-world data set 35% missing values!! Unknown number of errors!!

Mining Large Data Sets - Motivation R. Grossman (2001):”During the next decade, the amount of data will continue to explode, while the number of scientists and engineers available to analyze it will remain essentially constant.” P.S. Bradley (2003) : “The ability of organizations to effectively utilize thisinformation for decision support typically lags behind their ability to collect and store it. But, organizations that can leverage their data for decision support aremore likely to have a competitive edge in their sector of the market.”

Knowledge Mining (KM) process

Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Statistics/ Numerical optimization Machine Learning/ Pattern Recognition/ Artificial Intelligence Data Mining Visualization Database systems

Major Issues and Challenges in DM/KDD • Mining methodology • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web • Algorithmic requirements: Performance: efficiency, scalability, robustness, reliability • High dimensionality, complex and heterogeneous data • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Data quality: Handling noise and incomplete data (robustness, reliability) • Parallel, distributed and incremental mining methods • Integration of the discovered knowledge with existing one: knowledge fusion • Data Ownership and Distribution • User interaction • Expression and visualization of data mining results • Interactive mining of knowledge at multiple levels of abstraction • Applications and social impacts • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy

Data Mining

Data Mining

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data