260 likes | 469 Views
DATA PREPARATION: Basic Definitions. Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme.
E N D
DATA PREPARATION: Basic Definitions • Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme. • Data object (instance, record, case, sample, observation): An individual, independent data example of the concept to be learned, characterized by a number of attributes. • Attribute (feature): Property or characteristic of an object. • Model (concept): Pattern or description that is to be learned.
DATA PREPARATION: Attribute Types • Attribute value: Measurement of the quantity of that particular attribute. • Two basic attribute types: Qualitative and Quantitative. • Qualitative (categorical): Lack the properties of numbers. • Quantitative (numeric): Attributes represented by numbers and have their properties.
DATA PREPARATION: Attribute Types • Attribute types further distinguished by the number of values: Discrete versus continuous. • Discrete: A discrete attribute can have values from only a finite or countably infinite set of values. • Examples: Male/female, ages • Continuous: A continuous attribute can have values from an uncountable set of values such as the real numbers. • Examples: Temperature, weight, distance, time
DATA PREPARATION: Attribute Types • Nominal attribute: Qualitative names providing only enough information to distinguish from each other. No order or distance measure is implied. • Ordinal attribute: Qualitative names providing enough information to rank their order (Example: small, medium, large), but not enough to measure distance. • Interval attribute: Ordered and value differences are meaningful and measurable. • Ratio attribute: Both differences and ratios are meaningful and measurable.
DATA PREPARATION: Data Set Characteristics • Dimensionality: Number of attributes possessed by the data set instances. • Sparsity: Sparse data sets are those in which the most object attibutes are zero. • Resolution: The degree of discernable detail of an attribute value. How finely an attribute is measured.
DATA PREPARATION: Data Sets • Sources of Data Sets • Databases • Web sites • Streaming data
DATA PREPARATION: Data Sets • Data Input Formats • Data records • Text • Graph-based • Data matrix • Ordered data • Spatial data • Visual inputs • Video inputs
DATA PREPARATION: Record Data • Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes
DATA PREPARATION: Data Matrix • Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
DATA PREPARATION: Document Data • Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.
DATA PREPARATION: Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
DATA PREPARATION: Graph Data • Graph Data
DATA PREPARATION: Chemical Data • Chemical Data • Benzene Molecule: C6H6
DATA PREPARATION: Ordered Data • Ordered Data • Genomic sequence data
DATA PREPARATION: Ordered Data • Spatio-Temporal Data • Surface Air Temperature over North America, January--February 2014 • https://www.youtube.com/watch?v=VCCkyOTIS3o
DATA PREPARATION: ARFF format • ARFF: Attribute-Relation File Format. • See Weka Documentation: http://weka.wikispaces.com/ARFF • XRFF (eXtensible attribute-Relation File Format): An XML-based extension of the ARFF format. • See Weka Documentation: http://weka.wikispaces.com/XRFF
DATA PREPARATION: Data Conversion • Weka supports other data input types via filters • C4.5 • CSV • Libsvm • Svm light • Binary serialized instances
DATA PREPARATION: Data Conversion • What if desired data does not fit any of the Weka’s input types? • Translate manually (only works for small data sets) • Writing your own specialized script • Problem not unique to Weka • Data Conversion often underappreciated problem • Data collection Algorithm requirement mismatch
DATA PREPARATION: Data Quality • Measurement errors • Noise • Artifacts • Equipment limitations • Data collection procedure errors • Human error • Precision • Bias • Accuracy
DATA PREPARATION: Data Quality • Handling Data • Outliers • Missing or incomplete values • Estimate? • Ignore? • Inaccurate values
DATA PREPARATION: Data Quality • Multiple data sources • Inconsistent data: how to handle? • Duplicate data • Age of data • Data relevance
DATA PREPARATION KNOW YOUR DATA
NEXT STEP • Data Preprocessing