1 / 34

Data mining: theory and applications

Data mining: theory and applications. Heikki Mannila. Data Mining: Theory and Applications. Data analysis becoming more important in other sciences and in industry New measurement methods Ability to store data High-dimensional large data sets

gordon
Download Presentation

Data mining: theory and applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining: theory and applications Heikki Mannila

  2. Data Mining: Theory and Applications • Data analysis becoming more important in other sciences and in industry • New measurement methods • Ability to store data • High-dimensional large data sets • Non-traditional forms (e.g., strings, trees, graphs) • Data analysis lags behind

  3. Data mining • Has emerged as a major research area in the interface of computer science and statistics • Machine learning, databases, algorithms • Data analysis questions are increasingly visible in database and algorithms research • Theory and practice interact

  4. Goals • Develop novel data analysis techniques for the use of other sciences and industry • How? • Look at data analysis problems arising in practice • Abstract new computational concepts from them • Analyse the concepts and develops new computational methods • Take the results into practice • Theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas • The applications feed interesting novel questions to theoretical research

  5. Major themes in methods • Pattern discovery • Methods for sequence decomposition • Interplay of combinatorial and continuous methods in data mining • Techniques for the decomposition of large 0-1 data sets.

  6. Application areas • Genome structure • Gene expression data analysis • Palaeontology • Linguistic applications • Ubiquitous computing

  7. The people • Heikki Mannila, Hannu Toivonen, Jaakko Hollmen, Aristides Gionis, Floris Geerts, Bart Goethals • 6 Ph.D. students • A visible position in the international community

  8. Highlights • Finding recurrent sources in sequences • global structure in genomic sequences • recognizing recurrent contexts in mobile device usage • (k,h)-segmentation • Finding orderings of attributes from unordered binary data • Fragements of order • Spectral ordering techniques • Pattern discovery and mixture modelling techniques for onomastic data sets • Methods for finding topics in 0-1 datasets on the basis of co-occurrence information

  9. Finding recurrent sources in sequences • Sequences • DNA • Telecommunications • Etc. • How to find some global structure from a sequence? • Try to find homogenous segments from the sequence

  10. Finding homogenous segments T = S1 S2 S3S4 S5 S6 • Sequence T, integer k • Measure of homogeneity H for segments of T • E.g., H(S) = |S| Var(S) • Find the division T = S1,S2,…, Sk minimizing • Dynamic programming • (k,k)-segmentation: k-segments with no relationship to each other; independent sources

  11. (k,h)-segmentation • We want to limit the number of different types of segments • Only h<k different types are allowed • Find the best segmentation of T into k segments by using only h different types of segments

  12. Source 1 Source 3 Source 2 (6,3)-segmentation

  13. Data

  14. k = 3 and h = 3

  15. k = 3 and h = 2

  16. (k,h)-segmentation problem • Given sequence T • Find h sourcesw1,w2,…, wh • A decomposition of sequence T into k segments T = S1 S2 … Sk • Minimizing the sum of distances from each point t to the source wa(t) of the segment to which t belongs to

  17. Results • (k,h)-segmentation problem is NP-hard for dimension d>1, for L1 and L2metrics • Dimension d=1: complexity open • Simple approximation algorithms • d=1: 3-approximation for L1 • d=1: -approximation for L2 • d>1: 3+e –approximation for L1 for any e>0 • d>1: A+2 –approximation for L2, where A is the best approximation factor for k-means clustering • Very good performance in practice • The algorithms work for any generative model (not just reals with Lp metrics)

  18. Example: onomastic data • Names of lakes in Finland • About 150,000 lakes • What are the main trends? • High-dimensional marked point process • Collaboration with Research Center for the Languages of Finland (Kotus) • Similar data analysis problems arise also in environmental sciences

  19. Clustering on the basis of the names of lakes Similarity with the names of lakes in Kangasala

  20. Example: paleontological data • Given a matrix of occurrences of species in fossil sites • Ages of the fossil sites are not available • How to order the sites according to their age? • Background information: species arrive and vanish • Try to find ordering that minimizes Lazarus events species A B C 0 0 1 1 1 0 1 0 1 0 1 0 time Lazarus events

  21. Methods • Spectral ordering: form a Laplacian of the co-occurrence matrix, look at eigenvectors • Fragments of order: find short segments of orders which are not violated by observations • Other applications: text analysis, telecommunications

  22. Fortelius, Jernvall, Gionis, Mannila, in preparation

  23. Future research directions • Theory and practice • The combination of continuous and combinatorial methods • Concepts and algorithms for describing structure of sequences • Methods for pattern discovery in and modelling of spatiotemporal data • Theoretical models for data mining (such as inductive databases) • Foundational issues in pattern discovery (e.g., logical form of patterns and the difficulty in discovering them) • Publications, collaborations, software releases

  24. Applications in the future • Genome structure and its relation to function • Linguistic applications: spatial and temporal variation in language • Ubiquitous computing and telecommunications applications • Paleontological and ecological applications

  25. Mobile Computing Research at HIIT Kimmo Raatikainen Research Director Helsinki Institute for Information Technology kimmo.raatikainen@hiit.fi

  26. To address the research challenges arising in mobile computing systems and applications of tomorrow. Mobile computing will fulfil the vision of ubiquitous - invisible - computing providing access and services anytime, anywhere, and anyhow. The key research challenges are related to context-awareness, reconfigurability, adaptability, understanding user needs and experience, and personalization. Fuego Mission ”Any technology distinguishable from magic is insufficiently advanced,” Gregory Benford

  27. Present State • Some 20 researchers organised in two closely co-operating research groups • Mobile Computing Group (Prof. Kimmo Raatikainen) • User Experience Research Group (Prof. Martti Mäntylä) • Other senior researchers and post-docs: • Dr. Ken Rimey (software technologies, distributed computing) • Dr. Pekka Nikander, permanent visitor from Ericsson Research (security and privacy in Mobile Internet) • Dr. Timo Saari (user experience research, media science) • Dr. Jan Lindström (distributed data management, mobile data) • Other postdocs likely to be hired 2004

  28. Current Research Topics • Middleware for Mobile Wireless Internet – Fuego Core project • Mobile distributed event system • Mobile (XML-based) file system with intelligent synchronization • SOAP messaging over wireless (W3C: XML Binary Infoset) • Mobile Presence • Host Identity Protocol • Personal Distributed Information Storage – PDIS project • Synchronization-based peer-to-peer infrastructure for storage of structured XML data: PIM data, metadata for digital media • Context Recognition by User Situation Data Analysis – CONTEXT project • Bridge between User Experience Group at ARU and Adaptive Computing Systems Group at BRU • Software Architectures for Configurable Ubiquitous Systems – Sarcous project by SoberIT at HUT • Managing the large variety of software products

  29. Targets to 2005-2010 – 1/3 • to enlarge and strengthen international co-operation • current: WWRF, UCB, Fraunhofer FOKUS • new: Japan, KCL/Mobile VCE, an European NoE, CMU, … • but not forgetting co-operation in Finland: • HUT, UHE, Tampere Univ Tech, Univ Oulu, UIAH, … • to contribute to software architecture for Wireless World • to address challenges due to personal networking • minimal differences between solution stacks for ad-hoc communities and networked infrastructure • peer-to-peer, device-to-device solutions

  30. Not in primary focus but perhaps latter (and other smart places) Not in primary focus

  31. Targets to 2005-2010 – 2/3 • to put more focus on infrastructure for context-awareness and dynamic (end-user) systems • context modelling: presentation, maintenance, sharing, protection, reasoning, and queries • decision rules for reconfiguration • reflective (self-aware) middleware for personal networking • Fault tolerance in Wireless World • traditional exception will be the usual case • compensations, delayed/delegated actions, … • Trust and privacy in Wireless World

  32. Targets to 2005-2010 – 3/3 • user needs and novel application concepts • human factors of the Wireless World • basic psychosocial mechanisms • what makes a service use experience engaging and sustaining? • user-centric concept design (UCPCD) • process, methods, tools • novel application concepts based on context-awareness, other novel technologies • experience prototypes

More Related