- 334 Views
- Uploaded on
- Presentation posted in: General

Data Quality Issues

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Data quality:
- proper understanding is crucial to success of any project involving geographic data
- no geographic data sets can be said to be error-free
- “garbage-in, garbage out”

- Error:
- Difference between the real world and the geographic data representation of it.
- Accuracy: (another way of describing error)
- Extent to which map data values match true values
- Example: Imagine a point is at 219 meters elevation above sea level, but a map represents it as 210 meters above sea level.
- Error: This data point is represented with 9 meters of error.
- Accuracy: This data point is accurate to within 9 meters.

- Location errors
- Example: a schoolhouse is located 30 feet away from its marked location on a map
- A 300 meter contour line is offset 5 meters to the northwest
- A satellite image pixel is located 2.4 meters away from its actual location on the ground
- Attribute errors
- A schoolhouse is incorrectly labeled as a church
- A 300 meter contour line is actually supposed to be a 310 meter contour line
- A 300 meter contour line actually represents an elevation of 302 meters
- A classified satellite image pixel is labeled forest when it is actually a field

- One data point – error/accuracy can be easily defined.
- Data sets/maps – error/accuracy must be summarized.
- How is accuracy determined and summarized?
- Very accurate data must be collected (sampled) about a subset of the full dataset/map.
- This accurate sample is then compared with the original data
- A summary is created that compares these 2 datasets (the sample with the same measurements from the original data)

- Nominal data is right or wrong. Period.
- Examples:
- Landcover type: a pixel is classified as forest or field.
- A building is classified as a school or a church
- A county is named Orange County or Durham County

forest

fields

urban

water

Total

forest

80

4

0

15

7

106

fields

2

17

0

9

2

30

urban

12

5

9

4

8

38

water

7

8

0

65

0

80

Wetlands

3

2

1

6

38

50

Total

104

36

10

99

55

304

- An example is when you determine the accuracy of a landcover classification.
- We can build something called a confusion matrix:
- This compares your classification with your ground-truth sample (the very accurate sample data, as mentioned)

Reference

wetlands

Classification

- Summarizing a confusion matrix:
- Row and column summaries are made.
- The most basic overall summary statistic is the percent correctly classified
- This is calculated by taking the total of the diagonal entries, dividing by the grand total, and multiplying by 100 to produce a percentage
- From our example: 209 / 304 * 100% = 68.8%
- BUTchance alone (random assignment of classes) would give a score of better than 0

- A Kappa index :
- Determined through a “semi-complex” computation.
- It is another measure describing overall accuracy of a classification, ranging between 0 and 100%.
- A Kappa index can be used to test if a classification is statistically significantly better than a random classification.
- The Kappa index for our example evaluates to 58.3%

- The Overall accuracy (and row and column accuracies) are generally considered good/acceptable if they are above 85%. The USGS uses this as a guideline.
- The Kappa statistic describes agreement between the classified data and the reference data (it represents the increased accuracy of the performed classification over that of a random classification). A Kappa statistic of:
- Above 80% is considered to have strong agreement.
- Between 40% and 80% is considered to have moderate agreement.
- Below 40% is considered to have poor agreement.

- The overallmagnitude of errors in ratio measurements can be summarized using the root mean square error (RMSE),
- Calculated by taking square root of the average squared error
- This is a kind of average error
- This is the primary measure of accuracy used in map accuracy standards and GIS databases
- e.g. we might state that the elevations in a certain digital elevation model have an RMSE of 2 meters.
- 2 meters is a sort of “average error” for a data point.

- However, data error will range above and below this number.
Question: is this an example of locational error or attribute error?

- Locational data accuracy can also be summarized with RMSE.
- A kind of average of the distance points/pixels are represented from their actual location on the ground.

- Locational data can also be summarized in other ways:
- For horizontal data, the USGS uses the US National Mapping Accuracy Standards:
- 90% of all measurable points are within 1/50 of an inch for maps of spatial scale less than or equal to 1:20,000, and within 1/30 of an inch for maps of spatial scale greater than 1:20,000.

- Precision:
- Level of detail at which data values are recorded.
- Often referred to as ‘significant digits’.
- Example:
- A cell in a raster DEM recorded as 219 meters is less precise than a cell recorded at 219.05 meters.

- Error is unbiased when the error is in ‘random’ directions.
- GPS data
- Human error in surveying points

- Error is biased when there is systematic variation in accuracy within a geographic data set
- Example: GIS tech mistypes coordinate values when entering control points to register map to digitizing tablet
- all coordinate data from this map is systematically offset (biased)

- Example: the wrong datum is being used

- Example: GIS tech mistypes coordinate values when entering control points to register map to digitizing tablet
- Error can propagate…
- e.g., what happens if layer digitized with a spatial bias problem is used as the spatial reference to create another, new layer?
Propagation can be additive

- e.g., what happens if layer digitized with a spatial bias problem is used as the spatial reference to create another, new layer?

- Compatibility: can two or more geographic data sets be used together properly?
- e.g. is it meaningful to overlay roads data digitized at 1:10,000 scale with road hazard sites digitized at 1:250,000?

- Completeness: does a given data set adequately cover a study area? Are there gaps in space or time?
- Example: a city’s municipal cadastral database -- do all parcel polygons have attribute information? Are any parcels missing?

- Consistency: are geographic data sets consistent in terms of content, format, etc?
- Example landcover data layer for a study area -- different sub-areas produced from two satellite scenes...
- one Landsat TM & classified into 10 classes -vs.-
- one Landsat MSS & classified into 5 classes

- Example landcover data layer for a study area -- different sub-areas produced from two satellite scenes...

- Your responsibility:
- assessing the applicability of a data set for your needs.
- given the resolution, accuracy, precision, bias, compatibility, completeness, & consistency of a data set or analysis result---
--- is it appropriate or suitable for the intended use?

- Make use of lineage information/Metadata
at a minimum:

- description of source data
- how was the data transformed in preparation or analysis?

- Not “fuzzy logic”.
- Becoming more common in academic settings.
- Used with nominal data.
- Useful for landcover classifications.

- Consider a landcover classification with these classes:
- Forest
- Field
- Urban
- water

- We don’t assign a single class to each landcover pixel.
- Instead, we create a probability of membership to each class.
- We create 4 layers:
- Layer 1:
- The attribute data for each pixel is the probability that pixel is in forest.
- Layer 2:
- The attribute data for each pixel is the probability that pixel is a field.
- Layer 3:
- The attribute data for each pixel is the probability that pixel is urban.
- Layer 4:
- The attribute data for each pixel is the probability that pixel is water.

Membership map for bare soils

Membership map for alpine meadows

Spatial distribution of the three types by combining the fuzzy maps

Membership map for forests