1 / 47

Review goals and requirements for producing enhanced data description materials

Building a Geospatial Data Dictionary: Enhanced Data Description NEARC, Fall 2011 Brian Hebert Solutions Architect ScribeKey, LLC www.scribekey.com. Workshop Outline. Review goals and requirements for producing enhanced data description materials Look at approaches to data description

nowles
Download Presentation

Review goals and requirements for producing enhanced data description materials

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Geospatial Data Dictionary: Enhanced Data DescriptionNEARC, Fall 2011Brian HebertSolutions ArchitectScribeKey, LLCwww.scribekey.com

  2. Workshop Outline • Review goals and requirements for producing enhanced data description materials • Look at approaches to data description • US Census data as sample • Review ScribeKey shareware tools • Discussion and Q&A www.scribekey.com

  3. Goals • Make data as easy to understand and use as possible, reduce the learning curve. • Learning about data takes lots of time and effort and given dataset(s) are often part of larger data use and mission. • Make full use of the tools we have. • Apply these ideas to your own use cases. • Whether you are a user, provider, broker, creator of data, help people use it in the best way. www.scribekey.com

  4. Lessons Learned • Global FGDC Metadata and data description materials for large volume commercial geospatial data sets, containing 1000s of data layers and tables. • Assess, describe, and standardize large collection of geospatial datasets and metadata. • Borrow from data warehousing, business intelligence, and library science approaches. 200+ Countries 72 Layers 100s of Attributes 100s of Domains Quarterly Updates 50+ States 400 Layers 1000s of Attributes 100s of Domains Annual Updates www.scribekey.com 4

  5. Background: Industrial Strength Metadata Generation • Sample data is reviewed and profiled. Any metadata is imported into repository. • From profile, existing user documentation, technical support staff, and website, a metadata repository is populated and metadata document templates are developed. • FGDC/ISO Metadata generated, as XML/HTML reports, from metadata repository. Metadata Repository Metadata Templates Metadata Templates Metadata Export App PDF DOC FGDC XML HTML www.scribekey.com 5

  6. Sample Data: US Census • US Census Data for Saratoga Country, NY • Good example • Lots of detail • Has CSDGM metadata • Has its own vocabulary Saratoga County, NY Personal GeoDb www.scribekey.com 6

  7. How Do People Learn About Data? Website Metadata Documentation Email User Tech Support Data Itself Users learn how to use data through a variety of sources www.scribekey.com 7

  8. Challenges • Documentation: Large volume, time consuming • FGDC Metadata: Sets of separate XML documents, redundancy, cumbersome, different format than data being described, etc. • Website: Lots of great info, somewhat unstructured • Tech Support: Availability, cost • Data Itself: Familiarity takes time • How can we consolidate all of this information in a single place in an easy-to-use format? www.scribekey.com

  9. Solution: 2 Data Dictionary Formats 1) HTML Pages 2) GIS Metalayers Integrated Data/Metadata Flexible Familiar Simplification Lightweight Flexible Familiar Static or Dynamic www.scribekey.com

  10. Essentials: It’s All Metadata Meaning Structure Contents Q: What does it mean to be familiar with data? A: Users know where to find something and how to make detailed maps and reports. www.scribekey.com

  11. Creating FGDC CSDGM Metadata Identification_Information: Citation: Citation_Information: Originator: John Hancock Publication_Date: 2008 Title: Boston Streets Description: Abstract: The Boston Streets dataset provides a complete set of single line street segments for the town of Boston, Massachusetts. Purpose: The purpose of the Boston Streets dataset is to provide a basic street base map for general purpose use by the town and its people. Time_Period_of_Content: Time_Period_Information: Single_Date/Time: Calendar_Date: 2008 Currentness_Reference: Publication Date Status: Progress: Complete Maintenance_and_Update_Frequency: Quarterly Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -70.00 East_Bounding_Coordinate: -69.00 North_Bounding_Coordinate: 45.00 South_Bounding_Coordinate: 44.00 Keywords: Theme: Theme_Keyword_Thesaurus: None Theme_Keyword: Streets Access_Constraints: This dataset may be freely accessed by the public. Use_Constraints: This dataset may be freely used by the public. Metadata_Reference_Information: Metadata_Date: 20080219 Metadata_Contact: Contact_Information: Contact_Person_Primary: Contact_Person: Sam Adams Contact_Address: Address_Type: Mailing Address: 100 Beacon Street City: Boston State_or_Province: MA Postal_Code: 02108 Contact_Voice_Telephone: 508-429-1234 Metadata_Standard_Name: FGDC Content Standards for Digital Geospatial Metadata Metadata_Standard_Version: FGDC-STD-001-1998 • Checklist: • CSDGM Core • Only 26 Values • Attribute Definitions • Domain Values and Definitions • Use USGS MP Tool www.scribekey.com

  12. Geospatial Metadata Issues • There is no real support for non-geometric entities, e.g., tables. For example, the record count element is buried inside a geospatial element. So, there is no place to put a record count for a simple table. • There is incomplete representation for domains. Domains can’t be shared. Domains have no name of their own, but exist only as info added to an attribute. Domains can only have 2 values, so can’t support 3 related values, e.g., MA, Massachusetts, 25. • Attribute information is optional. Unlike the most basic RDBMS metadata available in any system, there are no elements for attribute data type and length. • There are no elements at the entity level for specifying relationships, through joins, etc. • Metadata at the individual feature record is not supported. • Describing data layers resulting from combinations of N source datasets is not supported. www.scribekey.com

  13. Geospatial Metadata Issues (cont.) • Because they are managed using two different physical implementations, geospatial data and metadata get out of synch. • Metadata is available as separate, independent documents. It can not easily be queried as a set. For example, getting a simple list of features/tables requires a custom XML application. • The FGDC CSDGM XML based standard is complex and difficult to understand by end users and vendors building tools. Based on an XML using variable length records and nesting, it is basically the schema for an object oriented database, not a relational or object relational database. • The new ISO standards are even more confusing and difficult to understand. ISO Layer metadata and entity, attribute, domain metadata are also now separated into two different standards. Current recommendation by FGDC is to continue using CSDGM. • http://www.fgdc.gov/metadata/geospatial-metadata-standards www.scribekey.com

  14. CSDGM Physical Implementation Guidelines • The FGDC/CSDGM standard clearly states that the standard describes content, and not physical implementation. From the CSDGM Workbook: The standard specifies information content, but not how to organize this information in a computer system or in a data transfer, or how to transmit, communicate, or present the information to a user. There are several reasons for this approach: There are many means by which metadata could be organized in a computer. These include incorporating data as part of a geographic information system, in a separate data base,and as a text file. Organizations can choose the approach which suits their data management strategy, budget, and other institutional and technical factors. In spite of these statements, geospatial metadata implementation has not been approached using industrial strength RDBMS data access technology, but rather relies on sets of separate XML files, using an entirely different data access and management paradigm than that used by the data it is describing. www.scribekey.com

  15. Centralizing Meaning, Structure, and Content: The RDBMS Based Metadata Repository FGDC XML Metadata RDBMS: Structure & Contents Data Profiling Roads METADATA REPOSITORY Parcels FGDC XML Metadata Metadata Import Buildings XML: Meaning & Geospatial FGDC XML Metadata Data and Metadata Sources Data Description Tools www.scribekey.com

  16. How Does Data Profiling Help? An essential tool for enhanced metadata: shows end user actual sample values, data types, lengths, formats, percent complete, etc. This valuable contents information is typically not found in geospatial metadata. www.scribekey.com

  17. CSDGM Core into the RDB XML Metadata IMPORT XML Metadata XML Metadata When metadata is imported into an RDB, the full flexibility of SQL becomes available for very flexible query and management of large volume data description information. www.scribekey.com

  18. Tools Demonstration Data Profiling Windows Based Batch Command Line .NET .mdb Files Logging Metadata Import www.scribekey.com

  19. Inside the Repository: Tables and View • PROFILE: • DiTABLES • DiCOLUMNS • DiDOMAINS • DiDomainValues • METADATA INGEST: • CsdgmEnt • CsdgmAtt • CsdgmDomVal • VIEWS: • EntRpt • AttRpt • DomRpt Elements from Profile and Metadata Ingestion can be combined through SQL views. Data structure, contents, and meaning housed in a table-centric RDBMS repository. Easy to access, query, and share. If you didn’t have CSDGM attribute metadata before, the data profile really helps with providing a baseline. www.scribekey.com 19

  20. Helping with the Data Provider/End User Communication Gap “Layer Table Attribute Map Symbol Centroid Join Report” “Impute FROMHN EDGES ADDRFN Internal Point MTFCC S1100” Provider Language User Language Data providers and users have different languages and understandings of data. Use of keywords, aliases, and definitions in data dictionary helps bridge this gap; provides a translation www.scribekey.com 20

  21. Schemas and Semantics Layers Attributes Symbols, Towns … UML, XSD GML ISO 19XXX ? The Tower of Babel Data Modelers ISO/OGC Schemas GIS Users What does this mean? Ontologies Abracadabra www.scribekey.com 21

  22. Next Steps: Clarification and Completion • We’ve integrated profile and metadata info • Now need to refine this information • Make sure everything is clear • Make sure everything is complete • Library Science to the rescue www.scribekey.com 22

  23. Library Science Artifacts • Indexing and Abstracting • The Dictionary Hierarchy • Types and Taxonomies • The Thesaurus • The Glossary With the Metadata Repository loaded, a number of useful data description artifacts can be developed. www.scribekey.com 23

  24. Indexing and Abstracting: The Overview Page • The most essential information • Clear concise writing • Links to details • Automated tools are no substitute for subject matter expertise • Limits of FGDC or ISO schemas as template • Data driven www.scribekey.com 24

  25. The Data Dictionary Hierarchy: Categories, Entities, Attributes, Domains • Data typically falls into higher level categories • Entities include layers and tables and relationships among them • Attribute data types, lengths, domain contents provide the heart of data detail for query, reporting, and mapping • A streamlined and flexible view of metadata www.scribekey.com 25

  26. Feature Types and Taxonomies • Users need to be able to search through metadata and data easily, using feature names they are familiar with. • Domain profiles and metadata are starting points for developing of feature description typology. • Isolated domain information doesn’t always present the entire picture. This HTML page allows users, to look up a feature name and find the corresponding layer and attribute SQL query that can be used to filter for it. www.scribekey.com 26

  27. The Thesaurus: What’s in a name? US Census MTFCC SDTS Entities www.scribekey.com 27

  28. Choosing the Best Names • If you’re developing a new set of names for data categories, entities, attributes, and domain values, use words that your data user audience is familiar with. • Don’t invent new words when an existing ones will do. Reuse taxonomies. • “Consistency is the last refuge of the unimaginative” Oscar Wilde • Natural language is often inconsistent, but can still be very clear for end users. www.scribekey.com 28

  29. Choosing the Best Names (cont.) lon/lat: 201,000,000 lat/lon: 7,870,000 The Google Test www.scribekey.com 29

  30. Tool Demonstration: Sql2Html www.scribekey.com 30

  31. Glossaries http://textalyser.net/ • Which words and terms need to be described? • Text analysis tools are freely available for helping with this task. • This list was generated from entity definitions. • Can also be used as input to list of keywords for FGDC metadata. www.scribekey.com 31

  32. Metalayers: Metadata as GIS Data Tables from the Metadata Repository can be easily accessed in ArcMap, and joined with polygon layers to provide access to fully integrated data/metadata www.scribekey.com 32

  33. Metalayers: Metadata as GIS Data (cont.) Metadata Repository layer/table information, as populated from data profiling and FGDC metadata ingestion, for US Census data, Saratoga County area, against full backdrop of New York towns. www.scribekey.com 33

  34. Table-Centric Metadata in ArcGIS • Metadata tables can be added to your ArcMap .mxd files. • If you have multiple sets of heterogeneous data, you can link metadata tables with polygons depicting data coverage areas. • Metadata can now be used like any other geospatial data, as the basis for color shading, symbology, reports, etc. • Metadata can be used to first find data, through lighter weight wrapper, then drill through to actual underlying data. www.scribekey.com 34

  35. Are Data Aggregation Results Metadata? • Data aggregation provides a key component of decision support information systems, AKA, Business Intelligence (BI). • Provides a smaller, faster, high level summary and simplification of large volumes of data. • Helps decision makers focus in on what’s important. • Created using standard RDBMS SQL aggregation constructs, SUM, COUNT, and GROUP BY and OLAP technology. AGGREGATE BASE DATA www.scribekey.com

  36. Metalayers: Aggregation www.scribekey.com 36

  37. Metalayer Drilldown and Rollup Increasingly detailed views COUNTY TOWN Applying Pivot Table like view and Drilldown and Rollup with hierarchical geography units CENSUS TRACT www.scribekey.com 37

  38. Meta-Layer Geometry Creation and Management Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -167.946360 East_Bounding_Coordinate: 179.001991 North_Bounding_Coordinate: 71.298141 South_Bounding_Coordinate: 17.678360 Lon/Lat Bounding Boxes 1 2 3 Three basic approaches to generating layer coverage polygons as 1) bounding boxes 2) convex/concave hulls, tessellations and 3) existing administrative or other polygons. Choice based on presentation and data management requirements. www.scribekey.com 38

  39. Convex Hull of Census Edges Layer Convex hulls are useful for describing arbitrary Metalayer coverage areas when no existing political or administrative boundary polygons are available. www.scribekey.com 39

  40. Summary and Take-Aways: 5 Phases • Developed standardized geospatial metadata • Profiled data • Integrated profile results and metadata in an RDBMS repository • Refined information, using library science approach and artifacts • Exported metadata from repository in 2 convenient formats, HTML and geospatial data layers. www.scribekey.com 40

  41. Take Away: Lightweight HTML Data Dictionary Full descriptions of data categories, entities, attributes, domain values. Information integrated from documentation, data profiles, metadata, and data provider website. Available as stand alone HTML or on web site. www.scribekey.com 41

  42. Take Away: Metalayers Use data profiles and metadata to create GIS layers to allow variety of map presentations, reports, etc. to summarize and highlight datasets by metadata values. www.scribekey.com 42

  43. Take Away: Data Description Checklist Meaning Structure • Is there a Data User Guide? A glossary and index? • Are primary data categories and entities fully described? • Are all acronyms, abbreviations, provider vocabulary terms explained? • Are short, cryptic database field names and values explained? • Are data types, lengths, keys, nulls allowed, formats, lists clear to help user form SQL queries? • Is FGDC/ISO Metadata available? • Are sample values and data profiles available? • Are data presentations, maps, symbols, reports prepared for quick start? • All this info in one place? Contents Complete metadata describes Meaning, Structure, and Contents. Maximize understanding of details by end user to help create queries/reports/maps. www.scribekey.com 43

  44. Take Away: Use a Geospatial Metadata Repository Data Dictionary METADATA REPOSITORY Data Layers Enhanced User Views Metadata Pivot Tables Areas Entities Derivative Datasets Documents Metalayers Assessments Attributes Domains New Schemas The Metadata Repository, implemented as an RDMBS, is populated with automated tools then used to generate metadata outputs, data dictionary content, schemas, maps, etc. www.scribekey.com 44

  45. The Future: Structured vs. Unstructured Query Query/Access Structured data queries require that a use know the exact entity.attribute=value construct to find data. Unstructured data queries can use underlying metadata tables like the FeatureFilter, to locate the correct entity.attribute=value construct to find data. Metadata is also generally much smaller volume than the data it is describing and can be queried very quickly. www.scribekey.com 45

  46. ScribeKey Shareware Tools • Data Profiler: SkProfile.exe • Metadata Importer: SkMtd2Db.exe • SQL To HTML Generator: SkSql2Html.exe • MS Access Metadata Repository • Look at ReadMe.txt files • Work with Personal Geodatabases • Requires .NET runtime www.scribekey.com 46

  47. Thank you Q&A Brian Hebert Solutions ArchitectScribeKey, LLCwww.scribekey.com

More Related