1 / 20

Term Paper

Term Paper. By Narendra Muppavarapu Venkatasai pulluri. Data Reduction in Very Large Spatio -Temporal Datasets Exploratory spatio -temporal visualization: an analytical review Spatio temporal symbolization of multidimensional time series. Outline. Motivation Introduction

hateya
Download Presentation

Term Paper

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term Paper By NarendraMuppavarapu Venkatasaipulluri

  2. Data Reduction in Very Large Spatio-Temporal Datasets • Exploratory spatio-temporal visualization: an analytical review • Spatiotemporal symbolization of multidimensional time series

  3. Outline • Motivation • Introduction • Problem specification • Related work • Methodology • Conclusions • Future work • References

  4. Motivation • Spatio-temporal datasets are often very large and difficult to analyze and not easy to use. Because of this these papers mainly focus on data mining techniques and information loss. There are different methods proposed to address this problem, like clustering method, SDNN approach etc..

  5. Introduction • Spatio temporal datasets are fundamental for decision support in many application contexts, recently a lot of interest has arisen towards data-mining techniques to filter out relevant subsets of very large data repositories as well as to help visualization tools to effectively display results. • Traditionally, the concept of Data Reduction has received several names, e.g. editing, condensing, filtering, thinning, etc, depending on the objective of the reduction task. • An approach for dealing with the intractable problem of learning from huge databases is to select a small subset of data for mining.

  6. There may be a problem of redundant data in databases so it would be very easy if large databases are replaced by small subsets. So that we can compare the exact accuracy obtained from reduced datasets can be comparable to accuracy obtained by the entire datasets. • Spatio temporal symbolization of multidimensional time series deals with the symbolization algorithm, and main goal of symbolization is to estimate the symbolic sequence which can minimize the loss of information. • There are many data mining techniques to filter out relevant subsets of very large datasets as well as to help visaulization tools to effectively display results.

  7. Problem specification • it is difficult to analyze spatio temporal datasets which have very large scale raw data. In order to solve this problem clustering approach is used which decrease the large scale data by retrieving its useful data without loosing its important information. • However, these datasets are often very large and grow at a rapid rate. The main idea is to reduce the size of the data by producing a smaller, knowledge-oriented representation of the dataset, as opposed to compressing the data and then uncompressing it later for reuse.

  8. Related work • The previous works were mainly explored incremental techniques for clustering especially in the context of data streams, after this they had been looked at optimizing algorithms to reduce the computational complexity and to improve the clustering time. • It is difficult to make the assumption that probability distributions on separate attributes are statistically independent of each other. • There is another approach which reviewed relevant temporal data mining and symbolization techniques based on two factors like assuming time series and second is type of results from a specific method. Time series methods can be classified in to four groups. • The main idea is to reduce the size of the data by producing a smaller knowledge oriented representation of dataset which can compress the data and uncompressthe data for later reuse.

  9. Methodologies • Spatio temporal data mining frame work: • Spatio temporal data mining frame work consists of two layers • Mining layer • Visualization layer • The purpose of first phase is to group the data based on their similarity and represent these groups in such a way that without losing important information, the purpose of second phase is to apply mining techniques such as clustering

  10. Clustering • In spatio temporal data mining systems there are two important preprocessing utilities which are • Discretizer: deals with the discretization method • Reducer: deals with the very large size of the data sets that have to be analyzed by the system. • Clustering for compression: this approach is used to improve the reducer preprocessing utilities. • The new data reduction method based on the clustering approach is to help with the mining of very large spatio temporal data set because the raw data set is too large for any algorithm to process. • The main strategy is used to design the clustering technique

  11. The following figure represent the overview of mining strategy: • The main idea of this strategy is to reduce the size of the data by producing the smaller representation of the data sets.

  12. Clustering algorithm • clustering is one of the fundamental techniques in data mining. • In this method they have implemented the center based clustering method which is also known as k-medoids. • With the center based clustering methods we need to specify a total number of passes. • With each pass the centers are adjusted to minimize the total distance between cluster centers and each record. • The k-medoid algorithm chooses the closest data object to the center of the cluster as cluster representative which is very useful to visualize the clusters with their representatives.

  13. The following figure represents the steps carried out by this algorithm • The data points will be grouped together which have a very high similarity between each other.

  14. Generating partition as symbolization. In this time series as observable data generated by a non linear dynamical system. This is called “generating” . This generating partition would not loose any information. Probabilistic distribution of generating partition: SDNN Algorithm: The present algorithm, the first step of analyzing such a non linear dynamical system is to reconstruct the phase space from a given time series. This uses some unique symbols, but in real time their may be lose of symbols or redundant. So in SDNN probabilistic distribution of symbols are generated. The steps are given E-step: Bayesian updates of symbol distribution M-step: Dimension selection.

  15. Experimental results • The platform of experimentation is a PC of 3.4 GHz Dual Core CPU, 3GB RAM using Java 1.6 on Linux kernel 2.6. Datasets of each time-step include 13 non-spatioattributes, so-called dimensions. • The following figure shows that data points for each time stamp before and after the reducing process by a center based clustering B Before After

  16. conclusions • Clustering techniques are very useful to analyze spatio temporal data sets included • The visualization of clusters can help in understanding the structure of spatio temporal data sets • The use of simplistic similarity to overcome the complexity of data sets • Use of clusters can filter without loosing important information. • Spatio temporal data mining is an emerging research area which deals with the interactive approaches for analyzing very large spatial and spatio temporal data sets.

  17. SDNN is robust to noise and suitable for the application to a dataset from the real world to unknown noise • The clustering method is improved algorithm to reduce very large spatio temporal data sets

  18. Future work • In this paper, we have presented the first task in our 2-pass strategy where the objective is to find the data points that are most similar according to their static (non spatial and temporal) parameters. • The second task is to cluster these groups of closely related data points in a meaningful way to produce new “meta-data” sets so that they are more suitable and acceptable for data mining techniques to analyze and produce results (i.e. models, patterns, rules, etc.) • In the future we intend to analyze different combinations of dimensions over more time steps to try and find hidden information on their relationships with each other.

  19. References • Michael Whelan, Nhien An Le Khac, M-TaharKechadi “Data Reduction in Very Large Spatio-Temporal Datasets” • Natalia Andrienko*, Gennady Andrienko, Peter Gatalsky “Exploratory spatio-temporal visualization:ananalytical review • Shohei Hidaka and chenyu “spatio temporal symbolization of multidimensional time series” • “Data Reduction Techniques for Large Qualitative Data Sets” by EMILY NAMEY, GREG GUEST, LUCY THAIRU, AND LAURA JOHNSON • “A New Hybrid Clustering Method for Reducing Very Large Spatio-temporal Dataset” by Michael Whelan, Nhien-An Le-Khac, and M.-TaharKechadi

More Related