Data mining: theory and applications

Data mining: theory and applications Heikki Mannila

Data Mining: Theory and Applications • Data analysis becoming more important in other sciences and in industry • New measurement methods • Ability to store data • High-dimensional large data sets • Non-traditional forms (e.g., strings, trees, graphs) • Data analysis lags behind

Data mining • Has emerged as a major research area in the interface of computer science and statistics • Machine learning, databases, algorithms • Data analysis questions are increasingly visible in database and algorithms research • Theory and practice interact

Goals • Develop novel data analysis techniques for the use of other sciences and industry • How? • Look at data analysis problems arising in practice • Abstract new computational concepts from them • Analyse the concepts and develops new computational methods • Take the results into practice • Theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas • The applications feed interesting novel questions to theoretical research

Major themes in methods • Pattern discovery • Methods for sequence decomposition • Interplay of combinatorial and continuous methods in data mining • Techniques for the decomposition of large 0-1 data sets.

Application areas • Genome structure • Gene expression data analysis • Palaeontology • Linguistic applications • Ubiquitous computing

The people • Heikki Mannila, Hannu Toivonen, Jaakko Hollmen, Aristides Gionis, Floris Geerts, Bart Goethals • 6 Ph.D. students • A visible position in the international community

Highlights • Finding recurrent sources in sequences • global structure in genomic sequences • recognizing recurrent contexts in mobile device usage • (k,h)-segmentation • Finding orderings of attributes from unordered binary data • Fragements of order • Spectral ordering techniques • Pattern discovery and mixture modelling techniques for onomastic data sets • Methods for finding topics in 0-1 datasets on the basis of co-occurrence information

Finding recurrent sources in sequences • Sequences • DNA • Telecommunications • Etc. • How to find some global structure from a sequence? • Try to find homogenous segments from the sequence

Finding homogenous segments T = S1 S2 S3S4 S5 S6 • Sequence T, integer k • Measure of homogeneity H for segments of T • E.g., H(S) = |S| Var(S) • Find the division T = S1,S2,…, Sk minimizing • Dynamic programming • (k,k)-segmentation: k-segments with no relationship to each other; independent sources

(k,h)-segmentation • We want to limit the number of different types of segments • Only h<k different types are allowed • Find the best segmentation of T into k segments by using only h different types of segments

Source 1 Source 3 Source 2 (6,3)-segmentation

Data

k = 3 and h = 3

k = 3 and h = 2

(k,h)-segmentation problem • Given sequence T • Find h sourcesw1,w2,…, wh • A decomposition of sequence T into k segments T = S1 S2 … Sk • Minimizing the sum of distances from each point t to the source wa(t) of the segment to which t belongs to

Results • (k,h)-segmentation problem is NP-hard for dimension d>1, for L1 and L2metrics • Dimension d=1: complexity open • Simple approximation algorithms • d=1: 3-approximation for L1 • d=1: -approximation for L2 • d>1: 3+e –approximation for L1 for any e>0 • d>1: A+2 –approximation for L2, where A is the best approximation factor for k-means clustering • Very good performance in practice • The algorithms work for any generative model (not just reals with Lp metrics)

Example: onomastic data • Names of lakes in Finland • About 150,000 lakes • What are the main trends? • High-dimensional marked point process • Collaboration with Research Center for the Languages of Finland (Kotus) • Similar data analysis problems arise also in environmental sciences

Clustering on the basis of the names of lakes Similarity with the names of lakes in Kangasala

Example: paleontological data • Given a matrix of occurrences of species in fossil sites • Ages of the fossil sites are not available • How to order the sites according to their age? • Background information: species arrive and vanish • Try to find ordering that minimizes Lazarus events species A B C 0 0 1 1 1 0 1 0 1 0 1 0 time Lazarus events

Methods • Spectral ordering: form a Laplacian of the co-occurrence matrix, look at eigenvectors • Fragments of order: find short segments of orders which are not violated by observations • Other applications: text analysis, telecommunications

Fortelius, Jernvall, Gionis, Mannila, in preparation

Future research directions • Theory and practice • The combination of continuous and combinatorial methods • Concepts and algorithms for describing structure of sequences • Methods for pattern discovery in and modelling of spatiotemporal data • Theoretical models for data mining (such as inductive databases) • Foundational issues in pattern discovery (e.g., logical form of patterns and the difficulty in discovering them) • Publications, collaborations, software releases

Applications in the future • Genome structure and its relation to function • Linguistic applications: spatial and temporal variation in language • Ubiquitous computing and telecommunications applications • Paleontological and ecological applications

Mobile Computing Research at HIIT Kimmo Raatikainen Research Director Helsinki Institute for Information Technology kimmo.raatikainen@hiit.fi

To address the research challenges arising in mobile computing systems and applications of tomorrow. Mobile computing will fulfil the vision of ubiquitous - invisible - computing providing access and services anytime, anywhere, and anyhow. The key research challenges are related to context-awareness, reconfigurability, adaptability, understanding user needs and experience, and personalization. Fuego Mission ”Any technology distinguishable from magic is insufficiently advanced,” Gregory Benford

Present State • Some 20 researchers organised in two closely co-operating research groups • Mobile Computing Group (Prof. Kimmo Raatikainen) • User Experience Research Group (Prof. Martti Mäntylä) • Other senior researchers and post-docs: • Dr. Ken Rimey (software technologies, distributed computing) • Dr. Pekka Nikander, permanent visitor from Ericsson Research (security and privacy in Mobile Internet) • Dr. Timo Saari (user experience research, media science) • Dr. Jan Lindström (distributed data management, mobile data) • Other postdocs likely to be hired 2004

Current Research Topics • Middleware for Mobile Wireless Internet – Fuego Core project • Mobile distributed event system • Mobile (XML-based) file system with intelligent synchronization • SOAP messaging over wireless (W3C: XML Binary Infoset) • Mobile Presence • Host Identity Protocol • Personal Distributed Information Storage – PDIS project • Synchronization-based peer-to-peer infrastructure for storage of structured XML data: PIM data, metadata for digital media • Context Recognition by User Situation Data Analysis – CONTEXT project • Bridge between User Experience Group at ARU and Adaptive Computing Systems Group at BRU • Software Architectures for Configurable Ubiquitous Systems – Sarcous project by SoberIT at HUT • Managing the large variety of software products

Targets to 2005-2010 – 1/3 • to enlarge and strengthen international co-operation • current: WWRF, UCB, Fraunhofer FOKUS • new: Japan, KCL/Mobile VCE, an European NoE, CMU, … • but not forgetting co-operation in Finland: • HUT, UHE, Tampere Univ Tech, Univ Oulu, UIAH, … • to contribute to software architecture for Wireless World • to address challenges due to personal networking • minimal differences between solution stacks for ad-hoc communities and networked infrastructure • peer-to-peer, device-to-device solutions

Not in primary focus but perhaps latter (and other smart places) Not in primary focus

Targets to 2005-2010 – 2/3 • to put more focus on infrastructure for context-awareness and dynamic (end-user) systems • context modelling: presentation, maintenance, sharing, protection, reasoning, and queries • decision rules for reconfiguration • reflective (self-aware) middleware for personal networking • Fault tolerance in Wireless World • traditional exception will be the usual case • compensations, delayed/delegated actions, … • Trust and privacy in Wireless World

Targets to 2005-2010 – 3/3 • user needs and novel application concepts • human factors of the Wireless World • basic psychosocial mechanisms • what makes a service use experience engaging and sustaining? • user-centric concept design (UCPCD) • process, methods, tools • novel application concepts based on context-awareness, other novel technologies • experience prototypes

Data mining: theory and applications

Data mining: theory and applications

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

Sequence Data Mining: Techniques and Applications

CS490D: Introduction to Data Mining Prof. Walid Aref

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

BUDT 725: Models and Applications in Operations Research

Advanced Topics in Data Mining: Association Rules

DATA WAREHOUSING AND DATA MINING

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE 538 Web Search and Mining Web Crawling

Approximate Mining of Consensus Sequential Patterns

Data Mining Tools

Chapter 32

Data Mining with DB

Data Mining using Fractals and Power laws

Data Mining with CANape 9.0

Predictive Learning from Data