1 / 43

From Data to Uncertainty Principles of Data Quality

From Data to Uncertainty Principles of Data Quality. Albatrosses, Kaikoura, New Zealand. Arthur D. Chapman. Australian Biodiversity Information Services. The Data Equation. Oceans of Data. Praia de Forte, Brazil. The Data Equation. Rivers of Information. Doubtful Sound, New Zealand.

becka
Download Presentation

From Data to Uncertainty Principles of Data Quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Data to UncertaintyPrinciples of Data Quality Albatrosses, Kaikoura, New Zealand Arthur D. Chapman Australian Biodiversity Information Services Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  2. The Data Equation Oceans of Data Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Praia de Forte, Brazil

  3. The Data Equation Rivers of Information Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Doubtful Sound, New Zealand

  4. The Data Equation Streams of Knowledge Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Wasatch, Utah, USA

  5. The Data Equation Drops of Understanding Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 (Nix 1984)

  6. Taking Data to Information Species Data Species Data Environmental Data Information Crab Florianopolis, Brazil Armeria maritima Argentina Brown Algae, Argentina Rock Cormorants Argentina Algae, New zealand Temp Range Wandering Albatros, NZ Orca, San Francisco Corals, Australia Rain June GIS Data Rain Jan Information Decisions Policy Conservation Management Models Decision Support Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  7. Why do we need to use models? Cont. USA Population: 292 million Collections: 0.5-1 billion Plants: 18,000 Reptiles: 350 Mammals: 428 Insects: ?150,000 Brazil Population: 172 million Collections: 50 million Plants: 70,000 Reptiles: 470-650 Mammals: 394 Insects: ?1.7 million Australia Population: 20 million Collections: 35 million Plants: 20,000 Reptiles: 850-890 Mammals: 305 Insects: ?220,000 From OBIS 2004 The Need for Modelling Oceans Population : ~ 0 Collections: ?10 million Plants: ?? Vertebrates: ?? Invertebrates: ?? Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  8. What do we mean by ‘Data Quality’? An essential or distinguishing characteristic necessary for [spatial] data to befit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe thefitnessof that dataset or recordfor a particular usethat one may have in mind for the data. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  9. Data quality - fitness for use? • Fitness for use • Does species ‘A’ occur in Tasmania? • Does species ‘A’ occur in National Park ‘y’ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  10. The Biological Data Domains Plant and animal specimen data held in museums provide a vast information resource, providing not only present day information on the locations of these entities, but also historic information going back several hundred years (Chapman and Busby 1994). Errors can occur in any one of these Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  11. Loss of data quality Loss of data quality can occur at many stages: • At the time of collection • During digitisation • During documentation • During storage and archiving • During analysis and manipulation • At time of presentation • And through the use to which they are put Don’t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  12. Principles of data quality The Vision: • It is important for organizations to have a vision with respect to having good quality data. • As well as a vision, an organization needs a policy to implement that vision. • And a strategy for implementation Experience has shown that treating data as a long-term asset and managing it within a coordinated framework produces considerable savings and ongoing value.(NLWRA 2003). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  13. The data quality vision A Vision may involve • Not reinventing information management wheels • Looking for efficiencies in data collection and quality control procedures • Sharing of data, information and tools • Using existing standards or developing new, robust standards • Fostering the development of networks and partnerships • Presenting a sound business case for data collection and management • Reducing duplication in data collection and data quality control • Looking beyond immediate use and examining requirements of users • Ensuring that good documentation and metadata procedures. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  14. Strategies Short term - Data that can be assembled and checked over a 6-12 month period Intermediate - Data that can be entered over about an 18-month period with small investment of resources - Data that can be checked using simple in-house methods Long Term - Data that can be entered and/or checked over a longer time frame, using collaborative arrangements Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  15. Information management chain Assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible (Redman 2001) From: Chapman 2004 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  16. Data Cleaning Principles -1 Data ownership and custodianship not only confers rights to manage and control access to data, it confers responsibilities for its management, quality control and maintenance. Custodians also have a moral responsibility to superintend the data for use by future generations (Chapman 2004) • Planning is essential • develop a vision, a policy and strategy • Total Data Quality Management Cycle 1 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  17. Data Cleaning Principles - 2 • Organising Data improves efficiency • The organizing of data prior to data checking, validation and correction can improve efficiency and considerably reduce the time and costs of data cleaning. • For example, by sorting data on location, efficiency gains can be achieved through checking all records pertaining to the one location at the same time, rather than going back and forth to key references. • Similarly, by sorting records by collector and date, it is possible to spot errors where a record may be at an unlikely location for that collector on that day. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  18. Data Cleaning Principles - 3 • Prevention is better than cure • It is far cheaper and more efficient to prevent an error from happening, than to have to detect it and correct it later. It is also important that when errors are detected, that feedback mechanisms ensure that the error doesn’t occur again during data entry, or that there is a much lower likelihood of it re-occurring. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Asplenium bulbiferum, New Zealand

  19. Data Cleaning Principles - 4 • Responsibility belongs to everyone • (collector, custodian and user) • The principle responsibility belongs to the data custodian • The collector has responsibility to respond to the custodian’s questions when the custodian finds errors or ambiguities that may refer back to the original information supplied by the collector. These may relate to ambiguities on the label, errors in the date or location, etc. • The user also has a key responsibility to feed back to custodians information on any errors or omissions they may come across, including errors in the documentation associated with the data. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  20. Data Cleaning Principles - 5 Yours is not the only organization that is dealing with data quality. • Partnerships improve efficiency • By developing partnerships, many data validation processes won’t need to be duplicated, errors will more likely be documented and corrected, and new errors won’t be incorporated by inadvertent “correction” of suspect records that are not in error. • Partnerships with: • Data collectors • Other institutions with duplicate collections • Like-minded institutions developing tools, standards and software • Key data brokers (e.g. OBIS, GBIF) • Data users (good feedback mechanisms) • Statisticians and data auditors Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  21. Data Cleaning Principles - 6 • Prioritisation reduces duplication • Prioritisation helps reduce costs and improves efficiency. It is often of value to concentrate on those records where lots of data can be cleaned at the lowest cost. • For example, those that can be examined using batch processing or automated methods, before working on the more difficult records. • By concentrating on those data that are of most value to users, there is also a greater likelihood of errors being detected and corrected. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Tierra del Fuego, Argentina

  22. Prioritising data quality procedures Not all data are created equal, so focus on the most important, and if data cleaning is required, make sure it never has to be repeated (Chapman 2004). • Focus on most critical data first • Concentrate on discrete units (taxonomic, geographic, etc.) • Ignore data that are not used or for which data quality cannot be guaranteed • Consider data that are of broadest value, are of greatest benefit to the majority of users and are of value to the most diverse of uses • Work on those areas whereby lots of data can be cleaned at the lowest cost (e.g. through use of batch processing). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  23. Data Cleaning Principles -7 • Set targets and performance measures • Performance measures are a valuable addition to quality control procedures, • They help an organization manage their data cleaning processes. • Performance measures may include statistical checks on the data (for example, 95% of all records are within 1,000 meters of their reported position), on the level of quality control (for example – 65% of all records have been checked by a qualified taxonomist within the previous 5 years; 90% have been checked by a qualified taxonomist within the previous 10 years). Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  24. Data Cleaning Principles - 8 • Minimise duplication and re-working of data • Duplication is a major factor with data cleaning in most organizations. • Many organizations add the geocode at the same time as they database the record. As records are seldom sorted geographically, this means that the same locations will be chased up a number of times. • By carrying out the georeferencing as a special operation, records from similar locations can then be sorted and then the appropriate map-sheet only has to be extracted once. • Some institutions also use the database itself to help reduce duplication by searching to see if the location may already have been georeferenced . Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Nothofagus antarctica, Argentina

  25. Data Cleaning Principles - 9 • Feedback is a two-way street • Users of the data will inevitably carry out error detection, and it is important that they feedback the results to the custodians. • It is essential that data custodians encourage feedback from users of their data, and take the feedback that they receive seriously. • Data custodians also need to feed back information on errors to the collectors and data suppliers where relevant. • In this way there is a much higher likelihood that the incidence of future errors will be reduced and the overall data qualityimproved. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  26. Data Cleaning Principles - 10 • Education and training improves techniques • Poor training, especially at the data collection and data entry stages of the Information Quality Chain, is the cause of a large proportion of the errors in primary species data. • Good training of data entry operators can reduce the error associated with data entry considerably, reduce data entry costs and improve overall data quality. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Brown Algae, Argentina

  27. Data Cleaning Principles - 11 • Accountability, Transparency and Audit-ability are important • Haphazard and unplanned data cleaning exercises are very inefficient and generally unproductive. • Within data quality policies and strategies – clear lines of accountability for data cleaning need to be established. • To improve the “fitness for use” of the data and thus their quality, data cleaning processes need to be transparent and well documented with a good audit trail to reduce duplication and to ensure that once corrected, errors never re-occur. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  28. Data Cleaning Principles - 12 • Documentation is the key to good data quality • Without good documentation, it is difficult for users to determine the fitness for use of the data and difficult for custodians to know what and by whom data quality checks have been carried out. • Documentation is generally of two types. • The first is tied to each record and records what data checks have been done and what changes have been made and by whom. • The second is the metadata that records information at the dataset level. • Both are important, and without them, good data quality is compromised. Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  29. Recording Accuracy and Error • Additional Accuracy Fields • Preferably in meters (Point-Radius) Documenting Validation tests • Who • What • How Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  30. Methods for geocode validation • Internal Database Checks • External Database Checks • Outliers in Geographic Space - GIS • Outliers in Environmental Space - Models • Statistical outliers Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Butterfly, Florida, USA

  31. Internal/External Database Checks • Logical inconsistencies within the database • Checking one field against another • Text location vs geocode or District/State • Checking one database against another • Gazetteers • DEM • Collectors Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Magellanic Penguin, Argentina

  32. Error Error is inescapable and it should be recognised as a fundamental dimension of data. Chrisman 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  33. Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  34. Diva-GIS - Outlier • Reverse jack-knifing technique • Threshold value t = 0.95(n) +0.2 www.diva-gis.org Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  35. CRIA-Data Cleaning http://splink.cria.org.br/dc/ Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  36. Principal Components Analysis - FloraMap Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data. A. Principal Components Analysis B. Specimen record.C. Mapped specimen. D.Climate profile Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  37. Cumulative Frequency Curves - DivaGiS Results from Diva-GIS showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  38. Environmental Outliers • Cumulative Frequency Curves Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  39. Errors in data Although most data gathering disciplines treat error as an embarrassing issue to be expunged, the error inherent in (spatial) data deserves closer attention and public understanding. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  40. Errors in data - 2 In general, error must not be treated as a potentially embarrassing inconvenience, because error provides a critical component in judging fitness for use. Chrisman, 1991 Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 Mizodendrum sp., Argentina

  41. Future Challengers Future Challengers • Improved data quality • Improved documentation of data • Improved access to distributed data • Improved methods for modelling in aquatic (including marine) environments • Decision Support Systems • Enlightened Policy / Decision Makers!!! Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  42. Thank You… Questions? Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

  43. Acknowledgements • Brazilian Biota/FAPESP Virtual Biodiversity Institute Program • Reference Centre for Environmental Information, Brazil (CRIA) • Global Biodiversity Information Facility (GBIF) • UNESCO • Wesleyan University, Connecticut, USA • Peabody Museum, Yale University, USA • ETI, Holland • UN Food and Agriculture Organization (FAO) • Environmental Resources Information Network, Australia (ERIN) • Commission on Data for Science and Technology (CODATA) Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

More Related