1 / 17

UIS Data Transformation and Validations

UIS Data Transformation and Validations. As it pertains to the SDMX TWG EXL Initiative. Gathering Data. Each data point to be collected is described with dimensions prior to collection Unique identifier is assigned to each data point/dimensional grouping Data is collected via surveys

raziya
Download Presentation

UIS Data Transformation and Validations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative

  2. Gathering Data • Each data point to be collected is described with dimensions prior to collection • Unique identifier is assigned to each data point/dimensional grouping • Data is collected via surveys • Data is inserted into the database by country/year for each survey returned • Data goes through a cleaning process that involves both human and automated validation (ERS)

  3. Data Encoding EMC_ID: Internal unique identifier used to store data. Each EMC_ID summarizes a set of dimension for data that we collect. In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored in separate dimension tables. For a more human legible format, each EMC_ID used in indicator definitions is also given an alphanumeric code that summarizes the dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1.

  4. Raw Data Validation (ERS) • Database (T-SQL) implementation. • Stored procedures and reporting services • Based on CONCEPTS Example: Concept: Redundant Data Check Description: UIS Surveys often have cells that are redundant in order to verify that the value entered in one cell is accurate and not the victim of a human input error Purpose: Verify that one cell equals another, redundant, cell Method: Validates that a specific “MASTER” cell is equal to any other redundant cell. Redundant cells are identified by having all dimensional values equal to the master cell with the exception of the PRIORITY dimension.

  5. Preparing Indicators (transformations) • Indicators are encoded in XML using extended MathML • Resulting XML file can render in a friendly manner in any browser, providing immediate documentation • Indicator XML file is “parsed” to convert the XML into database records • Indicator definitions are validated when parsed to ensure completeness as well as the existence of any needed indicators

  6. Indicator Definition Indicators are defined using MathML, with custom tags implemented by the UIS. <indicator name="GAP"> <label> <en>Graduation age population</en> <fr>Population d age de graduation</fr> </label> <formulas> <formula> <roll wildcard="isc" list="1,2.GPV,2"/> <roll wildcard="sex" list="F,T"/> <suffix>(isc).(sex)</suffix> <offset wildcard="age" low="Ag1" up="Ag25"> <sum> <d>thAge.(isc)</d> <d>thDur.(isc)</d> </sum> <c>1</c> <d src="POP" >P.(age).(sex)</d> </offset> <synonym wildcard="isc" use="2.A.GPV" for="2.GPV" /> <synonym wildcard="isc" use="2.A.GPV" for="2" /> </formula> </formulas> </indicator>

  7. Indicator Definition (cont.) • When loaded into a MathML enabled browser, the indicator definition becomes human readable and self documenting. • Rendering the XML in a browser also helps to validate that the XML indicator specification is well formed.

  8. Indicator Definition (cont.) • A parser is then used to convert the XML indicator specification to a database structure for use in processing the transformations

  9. calcIndic • Seasoned for 7 years • Currently on 4th version • Entirely developed using database stored procedures and T-SQL • Leverages well seasoned database functionality • Data, indicator definitions and transformation code all in a single database. Fast.

  10. calcIndic (part 2) • Indicator definitions are read • Each <d> (data) or <i> (indicator) tag is resolved by joining the required data point to the indicator definition for each country and year involved in the transformation • The steps for performing the calculation are performed based on the indicator definition • Data is written to domain-specific tables • Indicator validations are performed and problematic results are flagged. The reasons for each flag are logged to permit easy auditing.

  11. User Defined Indicator Validation (DIVA)(in development) • XML based. Validation rules for a particular indicator are defined alongside the indicator definition. • MathML based with extended custom tags • Validation process is SQL based • As with the indicator definition, browser plugin makes the XML definition self-documenting

  12. User Defined Indicator Validation (DIVA)(in development) <diva> <!-- relative change is not greater than 10< --> <formula cid="12" cids="1" range="0.1"> <roll wildcard="sex" list="M,F,T" /> <roll wildcard="isc" list="0,1,23,4,56" /> <suffix>(isc).(sex)</suffix> <test> <i>SAP.(isc).(sex)</i> <irelyear="-1">SAP.(isc).(sex)</i> <filter> <i>SAP.(isc)</i><i>SAP.(isc).M</i><i>SAP.(isc).F</i> </filter> </test> </formula> </diva>

  13. Dealing with missing/special data • Both ERS and calcIndic allow for special processing of missing data • Rules coding allow for custom treatment of special data • Normal rule for formulas: “Special data” properties are viral. If you add a list of numbers together, and one value is “missing”, the sum will be “missing”. • Normal rule for comparisons: Special data is only equal to similar special data (missing = missing).

  14. Dealing with missing/special data(cont.) • Specifying alternate processing rules possible on a case-by-case basis. • When defining an indicator, each data point can have a rule specified to enable an alternate way of dealing with special data • When defining a validation concept in ERS, each concept can have an alternate rule specified for comparisons

  15. ERS: Example of special data rules for comparisons Default Comparison Alternate Comparison for INCLUSION (when the data is included in the master cell)

  16. calcIndic: Example of special data rules for calculations <d>E.(isc).(age).(sex)</d> By default, if the above data point is missing, the indicator calculated will also be labeled as missing. <d MG=“2”>E.(isc).(age).(sex)</d> The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now be considered ‘nil’ or 0

  17. Future Development • DIVA • Ability to launch “on command”, instancing • Ability to calculate only the indicators that are affected by an underlying data change

More Related