1 / 20

Tools for Data Preparation

Tools for Data Preparation. November 8, 2002. Why Data Preparation?. Source: D Pyle, Data Preparation for Data Mining, 1999. Data Preparation Process. Data Selection. Data Cleaning. New Data Construction. Data Formatting. Data Selection. Based on The Following Criteria:

lynton
Download Presentation

Tools for Data Preparation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools for Data Preparation November 8, 2002

  2. Why Data Preparation? Source: D Pyle, Data Preparation for Data Mining, 1999

  3. Data Preparation Process Data Selection Data Cleaning New Data Construction Data Formatting

  4. Data Selection Based on The Following Criteria: • Data quality properties: completeness and correctness • Technical constraints such as limits on data volume or data type: related to data mining tools

  5. Data Cleaning Possible Techniques for Data Cleaning: • Data normalization. e.g., decimal scaling into the range (0,1) by mapping, or standard deviation normalization. • Data smoothing. e.g. Discretization of numeric attributes, this is helpful or even necessary for logic based methods.

  6. Data Cleaning Cont’d • Treatment of missing values. Predict missing values & replace them with the least biased values. e.g. Preserve the relationship between variables. • Data Reduction. The most usual step: examine the attributes and consider their predictive potential. e.g. attribute selection from means and variances, merging features using linear transform.

  7. Data Missing Example

  8. New Data Construction Constructive Operations on Selected Data Include: • Derivation of new attributes from the existing attributes. • Generation of new records. • Data Transformation. • Merging Tables. • Aggregation: Summarizing information from multiple records and/or tables.

  9. Data Formatting It Involves Syntactic Modification Required by Modeling Tools: • Reordering of the attributes or records. • Changes related to the constraints of the modeling tools: e.g. removing comma or tabs, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters.

  10. Data Preparation Tools • Data Junction Integration Studio- http://www.datajunction.com/ • SPSS Base 11.5 - http://www.spss.com/ • Informatica PowerCenter - http://www.informatica.com/ • WizWhy -http://www.wizsoft.com/

  11. Data Junction Integration Studio It includes five visual design tools: • Process Designer • Full conditional flow control • Testing of global variables • Execution of external processes and a complete expression language allow for automation of complex event-driven or scheduled routines • Multi-threaded Integration Engine

  12. Data Junction Integration Studio Cont’d • Map Designer • Mapping source data to target structures • Defining rules for mapping complex hierarchical structures • Define complex rules for record filtering • Error and reject record handling • Error logging

  13. Data Junction Integration Studio Cont’d • Metadata Query • Allows users to run queries against the Data Junction Metadata Repository • Record Layout Designer • A visual tool for defining or modifying data structures (including field names, sizes, length, offset, data types, etc.) for both sources and targets

  14. Data Junction Integration Studio Cont’d • Universal Data Browser • Allows users to view files other than the sources and targets involved in a current design session • View data formats from applications not installed on the system

  15. SPSS Base 11.5 Data Preparation Components Data Editor: a spreadsheet-like system for defining, entering, editing and displaying data Data preparation tools: get data ready for analysis. The Define Variable Properties tool to easily set up data dictionary information (such as value labels, variable labels and variable types) as a "template" so it can be applied to other data files and to other variables within the same file. Apply the dictionary information using the Copy Data Properties tool.

  16. SPSS Base 11.5 Cont’d Data Restructure Wizard: take a data file that has multiple records per subject and restructure it — so data for each subject are in a single record. No need to set up vectors or loops. Particularly helpful with transactional data. Can also do the reverse action — that is, take data from a single record and spread it across multiple cases.

  17. SPSS Base 11.5 Cont’d Data transformations: work with combined data more reliably by "flipping" responses — so all the data are in the same direction. e.g. Help to create multiple-item indices when working with surveys that ask respondents to give both positively worded and negatively worded responses. And other transformation capabilities: such as conditional transformation, compute new variables & recode values

  18. WizWhy Features: • Performs Boolean as well as multi-value analysis • Analyzes the data by discovering all the if-then rules • Reveals necessary and sufficient conditions (if-and-only- if rules) • Calculates the error probability of each rule • Reveals the interesting phenomena in the data by uncovering the unexpected rules

  19. WizWhy Features cont’d • Predicts new cases on the basis of the discovered rules • Explains predictions by listing relevant rules • Calculates the prediction’s conclusive probability and error probability • Predictions are based on error costs (a cost of a miss vs. false alarm) and not influenced by subjective choices • Points out cases deviating from the discovered rules • Proven to be faster and more accurate than other data mining methods

  20. WizWhy Rules Report Example 1) CUSTOMER starts with MORGA if and only if KEY is 985 The rule exists in 32 records. Significance Level: Error probability is almost 0

More Related