1 / 50

Understanding on Data

Understanding on Data. 201 4 . 9. What is Data?. Attributes. Data sets are made up of data objects. Data objects are described by attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc.

Download Presentation

Understanding on Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding on Data 2014. 9

  2. What is Data? Attributes • Data sets are made up of data objects. • Data objects are described by attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, dimension, or feature • Object is also known as record, point, case, sample, entity, tuple, or instance Objects

  3. Attributes • Attribute: a data field, representing a characteristic or feature of a data object. • E.g., customer _ID, name, address • Types: • Nominal • Binary • Ordinal • Numeric: quantitative • Interval-scaled • Ratio-scaled

  4. Attributes • Nominal: categories, states, or “names of things” • Hair_color = {auburn, black, blond, brown, grey, red, white} • marital status, occupation, ID numbers, zip codes, nationality • Binary • Nominal attribute with only 2 states (0 and 1) • Symmetric binary: both outcomes equally important • e.g., gender • Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal • Values have a meaningful order (ranking) but magnitude between successive values is not known. • Size = {small, medium, large}, grades, army rankings

  5. Numeric Attribute Types • Quantity (integer or real-valued) • Interval-scaled • Measured on a scale of equal-sized units • Values have order • Difference ( but not ratio) is meaningful • E.g., temperature in C˚or F˚, calendar dates • No true zero-point • Ratio-scaled • Inherent zero-point • Both difference and ratio are meaningful • 10 K˚ is twice as high as 5 K˚ • e.g., temperature in Kelvin, length, counts, monetary quantities, age, etc

  6. Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: • Distinctness: =  • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties

  7. Discrete vs. Continuous Attributes • DiscreteAttribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • ContinuousAttribute • Has real numbers as attribute values • E.g., temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables

  8. Types of Data and Data Sets • Record • Relational records • Data matrix, e.g., numerical matrix, crosstabs • Document data: text documents: term-frequency vector • Transaction data • Graph and network • World Wide Web • Social or information networks • Molecular Structures • Ordered • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences • Genetic sequence data • Spatial, image and multimedia: • Spatial data: maps • Image data: • Video data:

  9. Important Characteristics of Structured Data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Distribution • Centrality and dispersion

  10. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

  11. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

  12. Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

  13. Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

  14. Graph Data • Examples: Generic graph and HTML Links

  15. Graph Data : Social Network

  16. Graph data: Food Relationship

  17. Chemical Data • Benzene Molecule: C6H6

  18. Ordered Data • Sequences of transactions Items/Events An element of the sequence

  19. Ordered Data • Genomic sequence data

  20. Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean

  21. Summary Statistics • Summary statistics are numbers that summarize properties of the data • Summarized properties include frequency, location and spread • Examples: location - meanspread - standard deviation • Most summary statistics can be calculated in a single pass through the data

  22. Frequency and Mode • The frequency of an attribute value is the percentage of time the value occurs in the data set • For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. • The mode of a an attribute is the most frequent attribute value • The notions of frequency and mode are typically used with categorical data

  23. Measures of Location: Mean and Median • The mean is the most common measure of the location of a set of points. • However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also commonly used.

  24. Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data

  25. Measuring the Dispersion of Data • Variance and standard deviation • Variance: • Standard deviationσis the square root of variance σ2 • Percentile • The value that meets the given percentage of data sets • Five number summary • min, Q1, median,Q3, max • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Outlier: points beyond a specified outlier threshold, plotted individually • Graphically represented as a boxplot

  26. Variance • The variance or standard deviation is the most common measure of the spread of a set of points. • However, this is also sensitive to outliers, so that other measures are often used.

  27. Percentiles • For continuous data, the notion of a percentile is more useful. • Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than . • For instance, the 50th percentile is the value such that 50% of all values of x are less than . • Matlab: prctile

  28. Percentile graph

  29. Boxplot Analysis • Five-number summary of a distribution • Minimum, Q1, Median, Q3, Maximum • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number: min, Q1, median,Q3, max • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extended to Minimum and Maximum • Outliers: points beyond a specified outlier threshold, plotted individually • X > Q3 + 1.5 *IQR or X < Q1 – 1.5*IQR

  30. Example of Box Plots • Box plots can be used to compare attributes

  31. Homework • DB_score 파일의 모든 필드(attendance, mid-term, final, report, project, score)에 대하여 다음을 계산하시오. • mean, median • standard deviation, AAD, MAD (3)five-number summary,outliers (boxplot은 그림으로 표현할 것.) • Programming language: C, C++, Java

  32. Graphic Displays of Basic Statistical Descriptions • Boxplot: graphic display of five-number summary • Matlab: boxplot • Histogram: x-axis are values, y-axis represents frequencies • Matlab: hist • Quantile plot: each value xiis paired with fi indicating that approximately 100*fi % of data are xi • Matlab: quantile • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane • Matlab: scatter

  33. Histogram Analysis • Histogram: Graph display of tabulated frequencies, shown as bars • It shows what proportion of cases fall into each of several categories • The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent

  34. Histograms Often Tell More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions

  35. Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 * fi% of the data are below or equal to the value xi

  36. Scatter plot • Shows the relationship and dependency between two variables • Univariate analysis, Bivariateanalysis, Multivariate analysis • Each pair of values is treated as a pair of coordinates and plotted as points in the plane

  37. Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated

  38. Uncorrelated Data

  39. Visually Evaluating Correlation Matlab: plotmatrix

  40. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity

  41. Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

  42. Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

  43. Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance • r. “supremum” (Lmaxnorm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

  44. Example: Minkowski Distance Manhattan (L1) Euclidean (L2) Supremum

  45. Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q)  0 for all p and q and d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric

  46. Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p= q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

  47. Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)

  48. SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01= 2 (the number of attributes where p was 0 and q was 1) M10= 1 (the number of attributes where p was 1 and q was 0) M00= 7 (the number of attributes where p was 0 and q was 0) M11= 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

  49. Cosine Similarity • A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. • Applications: information retrieval, biologic taxonomy, gene feature mapping, ... • Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d

  50. Example: Cosine Similarity • Ex: Find the similarity between documents 1 and 2. d1= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94

More Related