1 / 22

CSE591 (575) Data Mining

CSE591 (575) Data Mining. 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU. Introduction. Introduction to this Course Introduction to Data Mining. Introduction to the Course. First, about you - why take this course? Your background and strength AI, DBMS, Statistics, Biology, …

maida
Download Presentation

CSE591 (575) Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE591 (575) Data Mining 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU

  2. Introduction Introduction to this Course Introduction to Data Mining

  3. Introduction to the Course • First, about you - why take this course? • Your background and strength • AI, DBMS, Statistics, Biology, … • Your interests and requests • What is this course about? • Problem solving • Handling data • transform data to workable data • Mining data • turn data to knowledge • validation and presentation of knowledge

  4. This course • What can you expect from this course? • Knowledge and experience about DM • Problem solving and solution presentation • How is this course conducted? • Presentations • Individual projects • Course Format • Individual Projects 40% • Exams and/or quizzes 40% • Class participation 20% • off-campus students?

  5. Projects - Start NOW! • How to start? • Projects should be sufficiently challenging but reasonable, suitable for one semester • How to choose your individual project • Real-world problems • Problems that might make differences • Two types of projects • Available projects • Self-proposed projects (Approval’s needed)

  6. Some project ideas • Dealing with high dimensional data • Data of supervised, unsupervised learning • Image mining • Feature extraction, clustering of images • Active sampling • Various data structures (kd-trees, R-trees, Multi-Dimen Scaling) • Meta data (RDF, namespace) for mining • Ensemble learning • Sequence mining (HMM learning) • Bioinformatics and applications (feature selection) • Intelligent driving data analysis • Data integration, data reduction (random projection)

  7. How is a project evaluated? • It depends on • What do you want to achieve • Its impact • Your effort • The sooner you start, the better • The beginning is not easy

  8. Course Web Site • http://www.public.asu.edu/~huanliu/cse591.html • My office and office hours • GWC 342 • T 10:30 - 11:30am and Th 4:00-5:00pm • My email: hliu@asu.edu • Slides and relevant information will be made available at the course web site

  9. Any questions and suggestions? • Your feedback is most welcome! • I need it to adapt the course to your needs. • Please feel free to provide yours anytime. • Share your questions and concerns with the class – very likely others may have the same. • No pain no gain – no magic for data mining. • The more you put in, the more you get • Your grades are proportional to your efforts.

  10. Introduction to Data Mining Definitions Motivations of DM Interdisciplinary Links of DM

  11. What is DM? • Or more precisely KDD (knowledge discovery from databases)? • Many definitions • A process, not plug-and-play raw data  transformed data  preprocessed data  data mining  post-processing  knowledge • One definition is • A non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data

  12. Need for Data Mining • Data accumulate and double every 9 months • There is a big gap from stored data to knowledge; and the transition won’t occur automatically. • Manual data analysis is not new but a bottleneck • Fast developing Computer Science and Engineering generates new demands • Seeking knowledge from massive data • Any personal experience?

  13. When is DM useful • Data rich • Two invited talks so far have convincingly demonstrate it • Large data (dimensionality and size) • Image data (size) • Gene data (dimensionality) • Little knowledge about data (exploratory data analysis) • What if we have some knowledge?

  14. DM perspectives • Prediction, description, explanation, optimization, and exploration • Completion of knowledge (patterns vs. models) • Understandability and representation of knowledge • Some applications • Business intelligence (CRM) • Security (Info, Comp Systems, Networks, Data, Privacy) • Scientific discovery (bioinformatics)

  15. Challenges • Increasing data dimensionality and data size • Various data forms • New data types • Streaming data, multimedia data • Efficient search and data access • Intelligent update and integration

  16. Interdisciplinary Links of DM • Statistics • Databases • AI • Machine Learning • Visualization • High Performance Computing • supercomputers, distributed/parallel/cluster computing

  17. Statistics • Discovery of structures or patterns in data sets • hypothesis testing, parameter estimation • Optimal strategies for collecting data • efficient search of large databases • Static data • constantly evolving data • Models play a central role • algorithms are of a major concern • patterns are sought

  18. Relational Databases • A relational databases can contain several tables • Tables and schemas • The goal in data organization is to maintain data and quickly locate the requested data • Queries and index structures • Query execution and optimization • Query optimization is to find the best possible evaluation method for a given query • Providing fast, reliable access to data for data mining

  19. AI • Intelligent agents • Perception-Action-Goal-Environment • Search • uniform cost and informed search algorithms • Knowledge representation • FOL, production rules, frames with semantic networks • Knowledge acquisition • Knowledge maintenance and application

  20. Machine Learning • Focusing on complex representations, data-intensive problems, and search-based methods • Flexibility with prior knowledge and collected data • Generalization from data and empirical validation • statistical soundness and computational efficiency • constrained by finite computing & data recourses • Challenges from KDD • scaling up, cost info, auto data preprocessing

  21. Visualization • Producing a visual display with insights into the structure of the data with interactive means • zoom in/out, rotating, displaying detailed info • Various branches of visualization methods • show summary properties and explore relationships between variables • investigate large databases and convey lots of information • analyze data with geographic/spatial location • A pre- and post-processing tool for KDD

  22. Bibliography • W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of Data Mining and Knowledge Discovery.

More Related