uis data transformation and validations n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
UIS Data Transformation and Validations PowerPoint Presentation
Download Presentation
UIS Data Transformation and Validations

Loading in 2 Seconds...

play fullscreen
1 / 17

UIS Data Transformation and Validations - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

UIS Data Transformation and Validations. As it pertains to the SDMX TWG EXL Initiative. Gathering Data. Each data point to be collected is described with dimensions prior to collection Unique identifier is assigned to each data point/dimensional grouping Data is collected via surveys

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'UIS Data Transformation and Validations' - raziya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
uis data transformation and validations

UIS Data Transformation and Validations

As it pertains to the SDMX TWG EXL Initiative

gathering data
Gathering Data
  • Each data point to be collected is described with dimensions prior to collection
  • Unique identifier is assigned to each data point/dimensional grouping
  • Data is collected via surveys
  • Data is inserted into the database by country/year for each survey returned
  • Data goes through a cleaning process that involves both human and automated validation (ERS)
data encoding
Data Encoding

EMC_ID: Internal unique identifier used to store data. Each EMC_ID summarizes a set of dimension for data that we collect.

In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored in separate dimension tables.

For a more human legible format, each EMC_ID used in indicator definitions is also given an alphanumeric code that summarizes the dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1.

raw data validation ers
Raw Data Validation (ERS)
  • Database (T-SQL) implementation.
  • Stored procedures and reporting services
  • Based on CONCEPTS

Example:

Concept: Redundant Data Check

Description: UIS Surveys often have cells that are redundant in order to verify that the value entered in one cell is accurate and not the victim of a human input error

Purpose: Verify that one cell equals another, redundant, cell

Method: Validates that a specific “MASTER” cell is equal to any other redundant cell. Redundant cells are identified by having all dimensional values equal to the master cell with the exception of the PRIORITY dimension.

preparing indicators transformations
Preparing Indicators (transformations)
  • Indicators are encoded in XML using extended MathML
  • Resulting XML file can render in a friendly manner in any browser, providing immediate documentation
  • Indicator XML file is “parsed” to convert the XML into database records
  • Indicator definitions are validated when parsed to ensure completeness as well as the existence of any needed indicators
indicator definition
Indicator Definition

Indicators are defined using MathML, with custom tags implemented by the UIS.

<indicator name="GAP">

<label>

<en>Graduation age population</en>

<fr>Population d age de graduation</fr>

</label>

<formulas>

<formula>

<roll wildcard="isc" list="1,2.GPV,2"/>

<roll wildcard="sex" list="F,T"/>

<suffix>(isc).(sex)</suffix>

<offset wildcard="age" low="Ag1" up="Ag25">

<sum>

<d>thAge.(isc)</d>

<d>thDur.(isc)</d>

</sum>

<c>1</c>

<d src="POP" >P.(age).(sex)</d>

</offset>

<synonym wildcard="isc" use="2.A.GPV" for="2.GPV" />

<synonym wildcard="isc" use="2.A.GPV" for="2" />

</formula>

</formulas>

</indicator>

indicator definition cont
Indicator Definition (cont.)
  • When loaded into a MathML enabled browser, the indicator definition becomes human readable and self documenting.
  • Rendering the XML in a browser also helps to validate that the XML indicator specification is well formed.
indicator definition cont1
Indicator Definition (cont.)
  • A parser is then used to convert the XML indicator specification to a database structure for use in processing the transformations
calcindic
calcIndic
  • Seasoned for 7 years
  • Currently on 4th version
  • Entirely developed using database stored procedures and T-SQL
  • Leverages well seasoned database functionality
  • Data, indicator definitions and transformation code all in a single database. Fast.
calcindic part 2
calcIndic (part 2)
  • Indicator definitions are read
  • Each <d> (data) or <i> (indicator) tag is resolved by joining the required data point to the indicator definition for each country and year involved in the transformation
  • The steps for performing the calculation are performed based on the indicator definition
  • Data is written to domain-specific tables
  • Indicator validations are performed and problematic results are flagged. The reasons for each flag are logged to permit easy auditing.
user defined indicator validation diva in development
User Defined Indicator Validation (DIVA)(in development)
  • XML based. Validation rules for a particular indicator are defined alongside the indicator definition.
  • MathML based with extended custom tags
  • Validation process is SQL based
  • As with the indicator definition, browser plugin makes the XML definition self-documenting
user defined indicator validation diva in development1
User Defined Indicator Validation (DIVA)(in development)

<diva>

<!-- relative change is not greater than 10< -->

<formula cid="12" cids="1" range="0.1">

<roll wildcard="sex" list="M,F,T" />

<roll wildcard="isc" list="0,1,23,4,56" />

<suffix>(isc).(sex)</suffix>

<test>

<i>SAP.(isc).(sex)</i>

<irelyear="-1">SAP.(isc).(sex)</i>

<filter>

<i>SAP.(isc)</i><i>SAP.(isc).M</i><i>SAP.(isc).F</i>

</filter>

</test>

</formula>

</diva>

dealing with missing special data
Dealing with missing/special data
  • Both ERS and calcIndic allow for special processing of missing data
  • Rules coding allow for custom treatment of special data
  • Normal rule for formulas: “Special data” properties are viral. If you add a list of numbers together, and one value is “missing”, the sum will be “missing”.
  • Normal rule for comparisons: Special data is only equal to similar special data (missing = missing).
dealing with missing special data cont
Dealing with missing/special data(cont.)
  • Specifying alternate processing rules possible on a case-by-case basis.
  • When defining an indicator, each data point can have a rule specified to enable an alternate way of dealing with special data
  • When defining a validation concept in ERS, each concept can have an alternate rule specified for comparisons
ers example of special data rules for comparisons
ERS: Example of special data rules for comparisons

Default Comparison

Alternate Comparison for INCLUSION

(when the data is included in the master cell)

calcindic example of special data rules for calculations
calcIndic: Example of special data rules for calculations

<d>E.(isc).(age).(sex)</d>

By default, if the above data point is missing, the indicator calculated will also be labeled as missing.

<d MG=“2”>E.(isc).(age).(sex)</d>

The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now be considered ‘nil’ or 0

future development
Future Development
  • DIVA
  • Ability to launch “on command”, instancing
  • Ability to calculate only the indicators that are affected by an underlying data change