1 / 23

DATA PREPARATION: Basic Definitions

DATA PREPARATION: Basic Definitions. Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme.

lovey
Download Presentation

DATA PREPARATION: Basic Definitions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA PREPARATION: Basic Definitions • Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme. • Data object (instance, record, case, sample, observation): An individual, independent data example of the concept to be learned, characterized by a number of attributes. • Attribute (feature): Property or characteristic of an object. • Model (concept): Pattern or description that is to be learned.

  2. DATA PREPARATION: Attribute Types • Attribute value: Measurement of the quantity of that particular attribute. • Two basic attribute types: Qualitative and Quantitative. • Qualitative (categorical): Lack the properties of numbers. • Quantitative (numeric): Attributes represented by numbers and have their properties.

  3. DATA PREPARATION: Attribute Types • Attribute types further distinguished by the number of values: Discrete versus continuous. • Discrete: A discrete attribute can have values from only a finite or countably infinite set of values. • Examples: Male/female, ages • Continuous: A continuous attribute can have values from an uncountable set of values such as the real numbers. • Examples: Temperature, weight, distance, time

  4. DATA PREPARATION: Attribute Types • Nominal attribute: Qualitative names providing only enough information to distinguish from each other. No order or distance measure is implied. • Ordinal attribute: Qualitative names providing enough information to rank their order (Example: small, medium, large), but not enough to measure distance. • Interval attribute: Ordered and value differences are meaningful and measurable. • Ratio attribute: Both differences and ratios are meaningful and measurable.

  5. DATA PREPARATION: Data Set Characteristics • Dimensionality: Number of attributes possessed by the data set instances. • Sparsity: Sparse data sets are those in which the most object attibutes are zero. • Resolution: The degree of discernable detail of an attribute value. How finely an attribute is measured.

  6. DATA PREPARATION: Data Sets • Sources of Data Sets • Databases • Web sites • Streaming data

  7. DATA PREPARATION: Data Sets • Data Input Formats • Data records • Text • Graph-based • Data matrix • Ordered data • Spatial data • Visual inputs • Video inputs

  8. DATA PREPARATION: Record Data • Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

  9. DATA PREPARATION: Data Matrix • Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

  10. DATA PREPARATION: Document Data • Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

  11. DATA PREPARATION: Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

  12. DATA PREPARATION: Graph Data • Graph Data

  13. DATA PREPARATION: Chemical Data • Chemical Data • Benzene Molecule: C6H6

  14. DATA PREPARATION: Ordered Data • Ordered Data • Genomic sequence data

  15. DATA PREPARATION: Ordered Data • Spatio-Temporal Data • Surface Air Temperature over North America, January--February 2014 • https://www.youtube.com/watch?v=VCCkyOTIS3o

  16. DATA PREPARATION: ARFF format • ARFF: Attribute-Relation File Format. • See Weka Documentation: http://weka.wikispaces.com/ARFF • XRFF (eXtensible attribute-Relation File Format): An XML-based extension of the ARFF format. • See Weka Documentation: http://weka.wikispaces.com/XRFF

  17. DATA PREPARATION: Data Conversion • Weka supports other data input types via filters • C4.5 • CSV • Libsvm • Svm light • Binary serialized instances

  18. DATA PREPARATION: Data Conversion • What if desired data does not fit any of the Weka’s input types? • Translate manually (only works for small data sets) • Writing your own specialized script • Problem not unique to Weka • Data Conversion often underappreciated problem • Data collection Algorithm requirement mismatch

  19. DATA PREPARATION: Data Quality • Measurement errors • Noise • Artifacts • Equipment limitations • Data collection procedure errors • Human error • Precision • Bias • Accuracy

  20. DATA PREPARATION: Data Quality • Handling Data • Outliers • Missing or incomplete values • Estimate? • Ignore? • Inaccurate values

  21. DATA PREPARATION: Data Quality • Multiple data sources • Inconsistent data: how to handle? • Duplicate data • Age of data • Data relevance

  22. DATA PREPARATION KNOW YOUR DATA

  23. NEXT STEP • Data Preprocessing

More Related