Ph. D Student: TA Minh Thuy : USTH 2010 Director of thesis: Prof. LE Thi Hoai An

Techniques d’optimisation et de recherche opérationnelle en fouille de données évolutives et temporelles Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University Paul Verlaine - Metz - France

About me • Objective: • Development new models • Development new optimization methods Problems: unsupervised classification and selection of variables for data mining evolution and temporal (data stream). • Start date: 1 Dec 2010 • Team work: Algorithms and Optimization • Category: Information Technology. • Fields of research: Data Mining, Data Stream, Clustering, Classification, Feature Selection

Context • For many recent applications, the concept of a data stream is more appropriate than a data set. • The volume of such data is so large that it may be impossible to store the data on disk. Furthermore, even when the data can be stored, the volume of the incoming data may be so large that it may be impossible to process any particular record more than once. • The fact that the data in the streams show the temporal correlations. Such temporal correlations can help detect the important data evolution characteristics, and can used to develop efficient mining algorithms.

Context • The stream model is motivated by emerging applications involving massive data sets; • Examples: telephone records, customer click streams, multimedia data, financial transactions,... • In these cases, the data have a evolving continuously. • Examples, the dynamism of the services: content, structure, promotions,... or the change of user’s behavior, client’s interest,...or depend on time: time of the day, day of the week,...or depend on the events: summer vacations, new year,... • Therefore, the data stream poses some special challenges of data mining algorithms. It its necessary to design the mining algorithms effectively in order to account for changes in underlying structure of the data stream.

Problems: • Problem 1: Clustering data stream. • The existing methods of mining data streams focus on the whole period of data. • Consequently : only detected those predominant in the entire period of analysis. The behaviors occurring in short periods of time are not detected. • Model for clustering data stream problem: fix windows • Dividing the analyzed time period into more significant sub periods, with the aim of detect the evolution of old patterns or the emergence of the new ones, which would not have been revealed by a global analysis over the whole time period.

Problems: • Problem 2: Detecting changes in data streams. • In data stream, the data patterns may evolve over time. How about the change of data over time? - Disappears in a cluster of behavior - Appearance in a cluster of behavior - Splitting a cluster of behavior - Combine two or more clusters of behavior - No change • Model for detection change data stream problem: sliding windows

Problems • Problem 3: Feature selection based clustering. • An object can be presented by variables of different types (quantitative, qualitative or structured). The nature of the variables is bound to influence the definition of similarity between objects and the choice is very important. • The question is to choose among those relevant variables and eliminating those that are redundant. • Applications include: • medical diagnosis (cancer risk assessment, detection of cardiac arrhythmia,…) • text categorization (classification of email - spam or not, classification of web pages,…) • pattern recognition (face recognition, handwritten digit,...) • ….

Methodology • Using mathematic techniques to process the data mining problem, including optimization techniques. A lot of optimization problems in real-world is non convex. • To solve the optimization problem non convex, we study mathematical techniques DC programming and DCA (Difference convex algorithm). • DC Programming and DCA (DC Algorithms) introduced in 1985 by Pham Dinh Tao and developed by Le Thi Hoai An and Pham Dinh Tao since 1994 to become a classic and now increasingly popular.

Results: • TA Minh Thuy, LE-THI Hoai An, Lydia Boudjeloud-Assala: Clustering Data Stream Based on Sub-Windows: A DC Programming Approach – 15th Austrian - French - German conference on Optimization, International conference AFG11 - Toulouse, France, 19-23 Septembre 2011, pp 135-136

Ph. D Student: TA Minh Thuy : USTH 2010 Director of thesis: Prof. LE Thi Hoai An