Managing the data acquisition exchange relationship
Download
1 / 235

Managing the Data Acquisition & Exchange Relationship - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Managing the Data Acquisition & Exchange Relationship. By Michael Scofield Manager, Data Asset Development ESRI, Inc. Redlands, CA Asst. Professor, Health Information Management Loma Linda University [email protected] Vers. 32 MSP June 9, 2008 L-3. About Michael Scofield.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Managing the Data Acquisition & Exchange Relationship' - toyah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Managing the data acquisition exchange relationship

Managing the Data Acquisition & Exchange Relationship

By Michael Scofield

Manager, Data Asset DevelopmentESRI, Inc. Redlands, CA

Asst. Professor, Health Information ManagementLoma Linda University

[email protected]

Vers. 32 MSP June 9, 2008 L-3


About michael scofield
About Michael Scofield

  • Michael Scofield is Manager of Data Asset Development at ESRI in Redlands, California. He is a popular speaker in topics of data management, data quality, data warehouse design, as well as satellite imagery interpretation and emergency communications. His career has included education and private industry in areas of data warehousing and data management. His articles appear in DM Review, the B-Eye Newsletter, InformationWeek magazine, the IBI Systems Journal, and other professional journals.

  • He has spoken to over 120 professional audiences for groups such as Data Management Assn chapters, European Metadata Conferences, Information Quality Conferences, The Data Warehousing Institute, Oracle User Groups, Institute of Internal Auditors, Assn. of Government Accountants, Quality Assurance Association chapters, Assn. for Computing Machinery and other professional and civic audiences.

  • Mr. Scofield is also Asst. Professor of Health Information Management at Loma Linda University.

NMS intro


Alternate titles
Alternate titles:

  • “Managing the Data Acquisition Relationship”

  • “How Not to Mess Up When You Import Data”

“data acquisition”

…traditionally in science and engineering instrumentation.

“data”

Source

User


Topics areas of concern
Topics & Areas of Concern

  • Spelling out the relationship

  • Difference between data and information

  • Understanding specific data and information needs

  • Asking for the right data and finding what you need

  • Data value and utility

  • Assessing the burden on potential data providers

  • Scope and complete-ness of data


Topics cont
Topics(cont.)

  • Versioning and timeliness

  • Media and physical format

  • Compatibility of logical data architectures

  • Data quality assessment

  • Updates and refresh issues

  • Data collection bias

  • Legal issues

  • Continuing data flow surveillance


How do you describe a dataset
How do you describe a dataset?

  • ArchitectureWhat subjects (things) are described by a record Facts/fields/attributes/columns Logical data model

  • ScopeWhat records are included excluded on dimensions Dimensions: time, geography, org.,

  • CurrencyCompared to declared scope Table level, and column-specific

  • QualityPrecision Complete-ness (by column) Accuracy

However….data acquisition is much, much more.


Introduction
Introduction

  • Why talk about this?

  • Because…

  • …we want more and more data, and we don’t generate it all ourselves.

  • So….we acquire it somewhere else.


Never a simple flow of data
Never a simple flow of data!

Relationship

User(“target”)

“data”

Source

Expectations: subjects covered by data scope of data quality of data currency of data

Expectations: money how you use data burden others?

Terms: Usage rights

Often forgotten topics: Updates and refresh Corrections Documentation Other measures of quality


Kinds of data flows
Kinds of data “flows”

Trigger events:

Elapsed time (day, week, month, sub-day)

Source business event(usually a transaction)

Target business event(transaction makes request for limited data; e.g. bal. chk.)

Human decision (e.g. BI)

Record growth trigger(e.g. every 5,000 records in a source transaction file)

“Push” vs. “pull”:

When the trigger happens, which side does the heavy work?

Push: Data host compiles data file and sends a data file.

target environment

App.

Pull: Data requestor sends query to source database.

query

target environment

Appl. database

Importapp.

results


Flows exist in many places
Flows exist in many places

Data supplier

Enterprise

Appl. A

Appl. B

Un-coordinated applications

DW

Outsidedatauser

Acquired division

Business Intelligence

Appl. G


Each source has a data architecture
Each source has a data architecture

Data supplier

Enterprise

Constraints

Appl. A

Appl. B

Expectations

Un-coordinated applications

DW

Outsidedatauser

Acquired division

Business Intelligence

Appl. G


What is data architecture
What is data architecture?

  • The logical and semantic structure of the business (or that part of the business) and the data which describes and supports it.

  • Described by a data model

  • Subject entities Relationships Attributes Entity-relationship diagram

  • Is abstract (not understood by many)

  • Can be complex


Each flow has a data architecture
Each FLOW has a data architecture

Data supplier

Enterprise

Appl. A

Appl. B

Expectations

Un-coordinated applications

DW

Outsidedatauser

Acquired division

Appl. G

B.I.


Enterprise captured data life cycle
Enterprise-captured data life cycle

Transaction-based data capture

Data derivation & enhancement

Executivesummaryreports

Businessapplication

Association with own history

Integration with other lateral data

Computing derived data (ratios, aggregates, etc.)

other in-house applications

Businessdatabase

other in-house applications

export

DW

Archive


Reasons to import data
Reasons to import data

  • Enhance an internal DW for support of improved executive decision-making.

  • Bolster operational data resources independent of the data exchange relationship.

  • Engage in new business processes involving a B2B partnership formed through data exchange.

  • E-discovery: litigation

DW

B.I.


Reasons to import data1
Reasons to import data

Timing:

DW

B.I.

Periodic big batch files:

daily, weekly, monthly, etc.

Transaction-driven:

“micro” data flows (SOA)

One-time


Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Spelling out the relationship


Key questions
Key questions:

  • What are your expectations?

  • What are your uses of the data?

  • What motivates the source to give it to you?

  • What are the political-cultural barriers between you and the source?

  • What are your expectations of…

  • quality, complete-ness, currency media updates and refresh

  • How can you strengthen the relationship?


Political cultural barriers
Political & cultural barriers

Separate system

them

you

Peer division or department

them

you

Totally unrelated legal entity

you

them

“Information is power!”People don’t want to give up power.


Typical risks and surprises
Typical risks and surprises

  • To save money, the source does not maintain previous quality in data capture and processing. Updates show lower quality.

  • To expand its market, the source alters the logical and physical data architecture without telling you.

  • In response to business morphing pressures, the source alters the coding scheme for one or more fields.

  • The source discovers some errors, but does not inform you of it, nor supply you with corrections or corrected records.


Mitigating strategies
Mitigating strategies

  • Spell out all expectations about the data.

  • Develop language, words, & models to enhance precision of communication about data expectations.

  • Rigorous testing of data prior to purchase

  • Strengthen relationship through cooperative data testing strategies Offer to test their updates Provide non-threatening feedback on DQ Get source to seek you out as consultant on DQ (this will allow you to monitor their morphing pressures)


Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Data & information

structured data and unstructured data

What makes data (information) useful?


Data vs information
Data vs. information

data

information

simple (single) observation, fact, or declaration

data (facts) with context to be more meaningful and useful

“Knowledge: valuable information from the human mind”

For many thinkers, there is a subtle, almost philosophical difference between data and information.


Initial definitions
Initial definitions

  • Things and events.

  • A single observation about reality, clearly defined.

  • One or more items of data, with definition and context to make it meaningful.

  • Simultaneous awareness of much information, and ability to cognitively integrate it.

  • Knowing not to sleep through this lecture.

  • Reality

  • Data

  • Information

  • Knowledge

  • Wisdom


Structural elements of tabular data
Structural elements of tabular “data”

Database

Piece of data; “a fact”a.k.a. “cell”

2

Record

Table

What are you seeking? A fact, a record, a table, or a database?


Acquiring data or information
Acquiring data or information?

Tabular

Semi-structured

Unstructured

Text document

Cartesian dataset

Web page

diary, memoirs

The web is not a source!

It is a medium!

multi-table database

Raster


Data vs meaning
Data vs. meaning

Are these the same data?

Source A

Source B

Same meaning? Yes. But not the same data.

Mixed case is difficult to derive correctly from ALL CAPS.


Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Universe of knowledge, information, & data


Structured vs unstructured data
Structured vs. unstructured data

Structured data

Unstructured data

Personal letters

Memoirs, diaries

Literature (history, poetry, fiction)

Most books

Still images (paintings, photos, x-ray, ultrasound)

Sounds (sound recordings, EKG, SOSUS)

Moving images(cinema, TV, etc.)

Most tabular databases:

businessgovernmentscience & research

Can fit into RDBMS


Structured vs unstructured data1
Structured vs. unstructured data

Structured data

Unstructured data

Geospatial data

Vector data streets, areas

“points, lines, polygons”

Raster imagery topos

GIS data


Parsing and processing data
Parsing and processing data

Tabular data

Unstructured data

tabular

unstructured

Computers are good at processing.

SQL, relational model, etc.

Humans are good at processing.

memory, free association.


Processing unstructured data
Processing unstructured data

Examples:

Hearing classical music, and correctly guessing the composer.

Recognizing the signature style of a oil painting.

Recognizing voices

Reading emotions on faces

Understanding incomplete sentences.

Seeing humor (intended and not). .

Unstructured data

Humans are good at processing.

memory, free association.


Asking for the right data

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

…or…

Asking for the right information


Who is the first user the final user
Who is the first user? …the final user?

  • Analytical support of macro-decisions

  • Data warehouse and business intelligence Probably to be manipulated by analysts High-level decision-maker will use final output

  • Operational business system(micro-decisions)

  • geocoding customers CRM Oil exploration Agricultural field characteristics

  • Pure, undirected research

  • Discovery for litigation


What do decision makers want data or information
What do decision-makers want?Data or information?


Yeah we got data lots of data
“Yeah, we got data. Lots of data!”

  • 010011010111001001111011101100100010110111000101101100011001000000000001111000000111001110000000011101101110110001000010000010111001001111011101100100010110111000101100101100101110010011110111011001000110010000000000011110000001110011100000000111011011101100101101110011001011011000110010011100111000000001110000000001101110110001000010100100100110011000000000000000110001101011001100100111001110000000011101101110110001000010000010011001111010011010101011100100111101110110010001011011100010110110001100100000000000111100000011100111000000001110110111011000100001000001011100100111101110110010001011011100010110010110010111001001111011101100100011001000000000001111000000111001110000000011101101110110010110111001100101101100011001001110011100000000111000000000110111011000100001010010010011001100000000000011001001110011100000000100110011110100100110010011100111000000001110110111011000100001000001001100111101001000001100011010110011001001110011010011001111000110101101011100100111101110000010010001001110010101010001000010010001001001001000100000110010001011011100010110110001111100110011100111000000001110110111011000100001000001001100111100011010100000010111000011101101110110001000010000010011001111010010011001001110011100000000111011011101100010000100


Always strive to make information more useful to the recipient
Always strive to make information more useful to the recipient!

Basketball scores

Los Angeles LXXIV

San Antonio LXVII

Detroit LXXXV

Boston LXXIII

Seattle LXXV

Phoenix LXXIX


Data vs expression
Data vs. expression recipient!

% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8

Executive may ask for this:

Are you going to ask for just six records from your source?

This information (report) has a high probability of being inadequate. The executive will inevitably ask for more.

No! Why?


Supporting macro decisions is iterative
Supporting macro-decisions is iterative. recipient!

% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8

Knowledge worker(s)

External sources

ProcessFilteringAggregationExplorationCorrelationAnalysis

B.I.

Internal sources

Data whse

Data mart(s)

ETL


Raw data vs derived data
Raw data vs. derived data recipient!

  • You always want raw data, at the most granular level possible !

  • No ratios or averages --they can NOT be aggregated.

Population Density

=

Avg = 219.2

derived data

raw data


Anticipate the analysis and information delivery
Anticipate the analysis and information delivery. recipient!

  • Have data analysis tools ready.

  • Output will be iterative.

  • Best output allows for graphic analysis

  • Time series are valuable…

  • …but require history.

  • Don’t neglect history when asking for data.


Why trend graphs a k a time series
Why trend graphs recipient! (a.k.a. “time series”)?

Deaths per 100,000 persons due to heart disease in CY-2000

257

This statistic alone, lacks meaning!

We must give it context!


How do executives make decisions
How do executives make decisions? recipient!

  • Cognitive vs. feelings

When executives ask for data or information, be sure they understand the total costs.


Tables raw data are hard to understand
Tables (raw data) are hard to understand recipient!

U.S. Monthly unemployment statistics


Unemployment
Unemployment recipient!

Clinton

Bush II

Source: Bureau of Labor Statistics web site


Placing data points into context yields information
Placing data points into context yields information! recipient!

Surround your requested data points with context!

Time series

Peer data

Causal factors

Breakdown / drilldown

Graphical expression

All these require many more data points than the executive originally requested!

On nearly every dimension.


Choices in detail of data
Choices in detail of data recipient!

  • Original or derivative

  • Granular or summary

  • Filtered or not

  • Translated or not

Data is always easier to aggregate than to disaggregate!

It is always easier to filter out unneeded data than to request more data later.


Converting data to information
Converting data to information recipient!

  • Query and reporting tool are required

  • Needed functions:

  • Aggregation

  • Sorting and filtering

  • Association and joining

  • Clustering and stratification

  • Graphics


Converting data to information1
Converting data to information recipient!

Add context

Raw data

50,000

200

deaths

from TBin Baker County,CY-2004

Avg. populationBaker County,CY-2004

Add context

4 deaths per 1,000 pop.Baker Co., CY-2004

Compute ratio

Conclusion: Baker County has the lowest TB rate of 5 peer counties.


Converting data to information2
Converting data to information recipient!

Raw data

Add context

Compute ratio

Add context

Useful information


But time series is even better
But time series is even better. recipient!

Baker County

Baker County

Skip gasoline


Can precision be distracting
Can precision be distracting? recipient!

Rounded to $ mil.

Source: IBM Annual Report, 2005. Pennies contrived.


How much scope
How much scope? recipient!

  • My fundamental bias:

  • Get as much as you can get for the same price.

  • Time

  • Organizational

  • Cost is mainly labor--creating the extract file.

  • Same labor for getting 4 years of history as 2 years.

  • Media and storage costs are trivial.


Why more data
Why more data? recipient!

  • Testing

  • Continuity of definitions over time.

  • Reasonableness of row counts, etc.

  • Test predictive models on historical data.

  • Decision-makers will expand scope of query later.

Context! You can never have too much context!


Options in granularity
Options in granularity recipient!

Product summary

Line item detail

Prod./cust summary

Customer value

raw data file

derivative data files


Introduction recipient!

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Potential data providers:

your impact upon them


Key questions1
Key questions: recipient!

  • Does the data come from their operations?

  • Do they log business transactions adequately?

  • Do they log changes to kernel-stable entities adequately?

  • What “enhancements” must they make to their application to extract the data you desire?

  • What cutoff policies do they have on transactions?


Kinds of data source organizations
Kinds of data source organizations recipient!

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharingdata is a cultural value, not for revenue

Credit bureaus (Equifax, Experian, Trans Union)Marketing companies (D&B, DMA)Suppliers of… maps imageryNews org’s (UPI)Knowledge sellers (Lexus-Nexus)

BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone

Government agencies

Academic research

NGO’s


Kinds of data source organizations1
Kinds of data source organizations recipient!

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharingdata is a cultural value, not for revenue

May sell you the data, but more guarded about the documentation.

Writing external data documentation is an annoyance !

These people ought to get the documentation right and rich!

Ask to see it first.

Simpler datasets require less semantic documentation.


What burden will you place on the data provider
What burden will you place on the data provider? recipient!

  • Depends upon…

  • how they store and manage their data

  • …and…

  • your needs of scope, architecture, timing, quality.


Two kinds of data generation
Two kinds of data generation recipient!

Data as byproduct of business processes

Data as gathered as non-business research

commercial sector

banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.

governments

social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)

field surveys of land, topo, etc.

observations of external behavior:weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

satellite & aerial imagery

Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.


Two kinds of data generation1
Two kinds of data generation recipient!

Data as byproduct of business processes

Data as gathered as non-business research

Captured through business applications

Complex logical data architectures

May not have complete logging

Data extract may be a logistical and programming burden

Generally must be done for DW.

Captured through special studies

Generally simple logical data architectures

Not stored in application databases

Easier to extract and make copies


Research data
Research data recipient!

Data as gathered as non-business research

  • field surveys of land, topo, etc.

  • observations of external behavior:weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

  • satellite & aerial imagery

Simple logical data architecture


Research data1
Research data recipient!

Data as gathered as non-business research

Example: interview-based research -- census

Interviewee

Simple logical data architecture

Family

Employment

Residence


Research data2
Research data recipient!

Data as gathered as non-business research

Example: cancer research study

Patient

kernel-stable

Simple logical data architecture

Family

Hospital stay

episode

kernel-stable

Examination& diagnosis

Treatment

event

event


Data created in business processes
Data created in business processes recipient!

users

business transactions

Business application software

database reads and writes

Application database


Application characteristics
Application characteristics recipient!

Business application software

Built to facilitate business operations.

Data captured to support ops.

Has a logical data architecture: you need to understand it.

Generally designed to meet on-line performance expectations.

Memory (versioning) often not important on many entities (particularly customers).

Business will not stop (“freeze”) for you to extract data.


Application database characteristics
Application database characteristics recipient!

Architecture supports application.

Hopefully well-normalized.

May or may not include business event logging.

DBMS: IMS, relational, network, flat

Application database


Application database characteristics1
Application database characteristics recipient!

Application database

transactions

change logs

master files

business logs or DBMS logs

kernel-stable

events


Over time life cycle of subject entities
Over-time life cycle of subject entities recipient!

long lives

“kernel-stable”

limited life

“episode”

point-in-time transactions

“events”


Kernel stable entities
Kernel-stable entities recipient!

  • ships and aircraft

  • library holding

  • properties

  • cost centers

  • accounts (bank, credit card, G/L)

  • corporation

  • institution

  • groupings of above ***

  • customers

  • parties

  • people

  • departments

  • products

  • services

  • facilities

  • vehicles


Episode like entities
Episode-like entities recipient!

  • hotel room stay

  • prison sentence

  • unemployment benefit

  • conference registration

  • college enrollment

  • phone call (successful)

  • accounting period

  • hospital stay

  • subscription

  • illness

  • maintenance & support contract

  • employment period

  • project

  • library check-out


Event or transaction entities
Event or transaction entities recipient!

  • customer's order

  • shipment

  • invoice

  • G/L posting

  • phone call (failed)

  • sale of asset

  • treatment

  • test or observation

  • airline flight

  • inquiry

  • turnstile passage

  • application for college

  • graduation

  • credit card charge

  • paycheck


Kernel stable entities1
Kernel-stable entities recipient!

Stable identity and existence.

2 kinds of changes: change to existence or ID change to non-key attribute

Changes to attributes occur rarely.

Are such changes logged in the application?

Are all changes logged?

Is versioning valued?

  • customerspartiespeopledepartmentsproductsservicesfacilitiesvehicleslibrary holdingpropertycost centerGL accountinstitutiongroupings


Episode like entities1
Episode-like entities recipient!

Always exist over a finite period of time.

End-point not always known.

Often confused with the starting event. (may have same key)

May have many kinds of subordinate events.

May have subordinate episodes.

May or may not be mutually exclusive with peers.

  • hospital staysubscriptionillnesscontract for serviceemployment periodprojectlibrary check-outhotel room stayprison sentenceunemployment ben.conf. registrationcollege enrollmentphone call


Event or transaction entities1
Event or transaction entities recipient!

Not designed to last a long time.

Generally only one key date/time.

Revisions may occur, but rare.

May be negated or reversed by subsequent transaction.

Mutually exclusive with peers.

May be subordinate to one or more episodes.

Almost always subordinate to other kernel entity(s).

  • customer's ordershipmentinvoiceG/L postingphone call (failed)sale of assettreatmenttest or observationairline flightturnstile passageapplicationgraduationcr. card chargepaycheck


Event or transaction entities2
Event or transaction entities recipient!

BIG QUESTION!

Can records be changed (updated, corrected) after creation?

This has profound consequence upon your over-time updating of your copy of the data.

  • customer's ordershipmentinvoiceG/L postingphone call (failed)sale of assettreatmenttest or observationairline flightturnstile passageapplicationgraduationcr. card chargepaycheck


Basic accounting entities
Basic accounting entities recipient!

accounting period

All accounting data are either…

1. events (postings)

2. aggregates of events over a time period (episode), …or…

3. statement of condition at a point in time (balance sheet).

episode

balance sheetJan. 1, 2005

balance sheetDec. 31, 2005

event

event


Confusing episodes and events
Confusing episodes and events recipient!

book checked out

Other ambiguities:

incarceration

phone call

hospital admission

airline flight

episode

check-out event

expected return date

event

event


Episodes can contain events
Episodes can contain events recipient!

Hotel roomstay

Hospitalstay

Project

War

episode

episode

episode

episode

Campaign

Charge

Test

Tasks

episode

event

Medication

episode

Battle

Laborcharges

episode

event

Episodes may contain “sub-episodes”

Casualty

event

event


Data created in business processes1
Data created in business processes recipient!

users

business transactions

What data do you need from this environment?

What timing? once, continuous

How do you expect it to be extracted?

Who is going to make that happen?

Business application software

database reads and writes

Application database

Architecture and application design are often barriers to sharing data in an organization.


Are your data needs 1 time or continuous
Are your data needs 1-time, or continuous? recipient!

Jan.

Feb.

Mar.

Apr.

May

overtime “refresh” or update

big first extract and load

What data do you want in “updates”?

Incremental or complete refresh?

How about corrections?

time


Full refresh vs incremental updates
Full refresh vs. incremental updates recipient!

month-end copies

Simple extract for source: delete and reload target.

Complex processing both at source and at target

Consumes more resources

Resource (bandwidth) efficient.

You don’t know what changed, and what was deleted.

You can measure rate of change easily.

Both may require coding and paradigm translation


If logging takes place where
If logging takes place…where? recipient!

users

business transactions

Business event logging

Business application software

database reads and writes

DBMS technical logging

Application database


How are backups taken
How are backups taken? recipient!

users

business transactions

Business application software

database reads and writes

DBMS backup for archive, recovery

Application database


How are backups taken1
How are backups taken? recipient!

Business-readable full refresh

users

business transactions

Business application software

Extract business datafrom technical backup

database reads and writes

Application database

DBMS backup for archive, recovery


Is there a data warehouse
Is there a data warehouse? recipient!

users

Does this have the data you need?

business transactions

Business application software

DW

database reads and writes

Application database

Periodic extracts


Dangers of dw extract data
Dangers of DW extract data: recipient!

  • May not have the granularity you need.May already have been aggregated

  • May not have desired fields

  • May not have required scope (org, geo, etc.)

  • May not include corrections

  • May not match your needs of time covered

  • May have been transformed, cleansed, or filtered in some way.

DW

ETLfile


Distinguish between update correction
Distinguish between update & correction recipient!

Jan.

Feb.

Mar.

Apr.

May

incremental updates

big first extract and load

corrections

time


Update vs correction
Update vs. correction recipient!

Address was 123 Main St.

Is now 548 Elm St.

Record showed 519 Fern St.

Should have been 984 Mills.

He moved on April 4 (effective date)

We learned about it May 25.

We posted it on June 3 (record change date)

He never lived at 519 Fern Street.

It was an error.

It was never true.


Logical and physical structure of
Logical and physical structure of… recipient!

extract(big bulk snapshot)

update(new, change, delete)

correction

Logical data architecture

Logical data architecture

Logical data architecture

somewhat similar, but not the same

describes

data model

data model

data model

Will the physical file transfer format recognize nulls?


Source burden for incremental update
Source burden for incremental update recipient!

  • Create a record when any major table experiences a…

  • new record

  • change in an existing record

  • delete (or tag “delete”) of existing record

Change of kernel-stable records generally reflects a business event, and thus should be logged by application.

But is it? Or are all kernel entities so logged?


Is it important for you to know what changed why
Is it important for you to know what changed? Why? recipient!

  • Are the major changes to kernel-stable entities important to know?

  • Yes, they are, if they serve as dimensions.

  • Discontinuities of dimensions are problematic (an understatement) !


Example of kernel stable entity changes
Example of kernel-stable entity changes recipient!

Who wants to know about street address history?

marketing analysisepidemiology studiescredit rating analysissecurity clearance research

Customer

Address in time

Change log

Is it important to know address of customer for past history?

Does application software maintain address history?

If not, do you need to track such changes (go forward) ?

Are such changes being logged by application?


What are volatile fields attributes
What are volatile fields (attributes)? recipient!

  • Volatile attributes:

  • Street address

  • Cell phone number

  • e-mail address

Stable attributes:

SpouseChildren


Documentation burden upon source
Documentation burden upon source recipient!

Nobody likes writing data documentation

(except, perhaps, some data bigots).

Especially so…

…when incidental to their primary duties.

Especially so…

…long after the system change was made.

Possible solution:

For a discount, offer to send back to them data behavior documentation.

Requires reverse data engineering


Other issues
Other issues recipient!

  • Be suspicious of tabularizing unstructured data

  • Often requires coding taxonomies…

  • … are they sufficiently granular?


Example coding traffic fatalities
Example: coding traffic fatalities recipient!

1 Auto-pedestrian

2 Auto-auto

3 Auto-fixed object

4 Auto-railroad

  • Roll over after skid

  • Hit center divider

  • Hit bridge abutment

  • Drove off a cliff

  • Drove into drainage ditch

  • Hit a deer

  • Tree fell on vehicle

  • Collision with parked trailer

  • Bicyclist hit tree


Kinds of data source organizations2
Kinds of data source organizations recipient!

Selling (providing) data is a sideline to their primary business

Selling data is a major source of revenue

Sharing data is a cultural value, not for revenue

Credit bureaus (Equifax, Experian, Trans Union)Marketing companies (D&B, DMA)Suppliers of… maps imageryNews org’s (UPI)Knowledge sellers (Lexus-nexus)

BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone

Government agencies

Academic research

NGO’s


Introduction recipient!

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Physical form and media


Key questions2
Key questions: recipient!

  • Is the data being supplied on media which you can read with your technology?

  • Is a special program or database management system required to read it?

  • Is the documentation supplied in a manner which you can read and copy?

  • Is the data supplied in bulk, or incrementally, or even one transaction at a time?

  • Are there any compression techniques used on all or certain types of data in the file?


Structured vs unstructured
Structured vs. unstructured recipient!

Structured

Unstructured but indexed.

Unstructured NOT indexed.

Anything in a… spreadsheet DBMS file with defined fields

Automated-ly managed… documents (document mgmt systems) medical records medical imaging satellite imagery sound and video

Encyclopedia

MemoirsPersonal lettersLiteratureMeeting minutesBlogsPictures of my vacation

Library books are catalogued as a whole, but not in part.


Search engines and indexing
Search engines and indexing recipient!

  • The internet is a medium, not a source and certainly not an “authoritative source”.

  • Each web site probably has an agenda and bias.

  • Search engines find text—not meaning.

  • Web sites can mask tabular data from search engines.

  • Search engines may not see some academic sources (peer-reviewed journals, etc.) because of cost of access.


Physical media for structured data
Physical media for structured data recipient!

Physically moved –media:

Punched cardsHalf-inch mag tapeIBM tape cartridges9-inch floppy disk5-inch floppy disk3-1/2 inch floppy diskother cassettes or cartridges CD-ROMDVD …paper (yikes!)

Data moved virtually:

Electronic filesMessages (transactions)

Physical formats:

Full database (req. DBMS)Flat file (positional) single-format multiple format w/ rec typeChar-delimited fileMS/Excel (or MS/Word) XMLzip fileother

SOM chaos


Details in flat files
Details in flat files recipient!

Two record types!

business key

Rec type A

Rec type B

record type code


Details in flat files1
Details in flat files recipient!

Two record types!

business key

Note:

variable-length records

Note:

Children not grouped with parents.

record type code


Worse scenario of flat file
Worse scenario of flat file recipient!

business key, but found only in rec.type A

Two record types!

Important!

Record sequence have vital significance!

Bad technique!

record type code


Mixed media
Mixed media recipient!

Whole package you are provided

Data in RDBMS

Metadata in XML


Typical elements of a geodatabase
Typical elements of a Geodatabase recipient!

Feature class

Topology(rules)

Table 1

Table 2

Metadata in XML

Raster dataset(s)

Geometric network

domain

relationship class


XML ? recipient!

  • XML labels data items.

  • “self-documenting”

  • Means labeling, but not full, rich documentation of business meaning.

  • It does not describe attributes or entities(fields, or tables) from a business perspective.

  • XML takes more space – often much more.(the opposite of data compression?)


The opposite of xml data compression
The opposite of XML: recipient! Data compression!

What compression techniques, if any, might the source use when sending you the data?

Can you read it or unpack it?


Introduction recipient!

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Logical data architecture


Key questions3
Key questions: recipient!

  • What kind of things in the real world are described by the dataset?

  • How many kinds of tables or records are contained?

  • What are the cardinality rules between them?

  • Are the described instances in the real world mutually exclusive?

  • Are there format standards (industry or discipline) for this kind of data?

  • Does this data conform to those format standards?


Key questions cont
Key questions: recipient! (cont.)

  • What is the meaning of each record? What “thing” in reality does it represent?

  • What is the business meaning of each field?

  • Are any fields employed for more than one purpose?

  • Is the value or meaning of any field contingent upon the value in another?

  • What coding conventions are employed?

  • How are names and addresses structured?

  • Are you going to be integrating this source with other data?


Ambiguity of terms
Ambiguity of terms! recipient!

“interface”

“System” A

“System” B

“bridge”

“connect”

“data access”

“migrate data”

“flow”

“link”


Inherently ambiguous terms about link
Inherently ambiguous terms about “link” recipient!

interface

bridge

“integrate with”

support

connectconnector interconnect

exchange data

migrate data

publish

provide access

exchange data

All have in common:

data movement

What kind of data? What fields? What architecture? What causes data to move?


Linking organizations together ha
Linking organizations together: “Ha!” recipient!

Agency “A”

Agency “B”

Architecturally different

Business

Business

DataApplication Database

Semantic compatibility.

DataApplication Database

Logical data arch.

Application

Software

Protocol compatibility.

Application

Software

Op Sys

Op Sys

XML, etc.

infrastructure

infrastructure

Physical communication.

Landline, WiFi, mobile, etc.

A business has an architecture!


Linking organizations together ha1
Linking organizations together: “Ha!” recipient!

Agency “A”

Agency “B”

Architecturally different

Business

Business

DataApplication Database

Semantic compatibility.

DataApplication Database

Logical data arch.

Semantic compatibility:

Presence of data elementsField format compatibilityDefinitional consistencyKeys don’t clash (homonyms, non-reuse, etc.)Subject entities have similar life cycles

These are subtle, abstract concepts. Not understood by executives or hardware people.


Two levels of architecture matching
Two levels of architecture matching recipient!

Agency “A”

Agency “B”

Architecturally different

Business

Business

DataApplication Database

Semantics & meaning

Structural architecture

DataApplication Database

Logical data arch.


Semantic and meaning field level
Semantic and meaning (field level) recipient!

  • Two fields (in two environments) can have…same name, same format, but different domain.

Source-A

Source-B


Semantic and meaning table level
Semantic and meaning (table level) recipient!

  • Two tables (in two environments) can have… same name, same format (and column list),but different scope or entity meaning.

Source-A

Source-B

same format

unfilled

month total

Customer orders

Customer orders

different scope


What do we mean by link
What do we mean by “link”? recipient!

Reposit data into a data warehouse(periodic, in batch)

Application databases

Datawhse

1

Appl. 1

2

Appl. 2

ODS

Appl. 1

Reposit data into an ODS(at time of transaction)

Appl. 2

Application databases

1

Appl. 1

Replicate data instantly (at time of transaction)

2

Appl. 2


Instant transactional replication
Instant, transactional, replication recipient!

Exchange services

Appl. 1

Appl. 2

API

API

1

2

Are the architectures compatible?

Probably not!


Semantic integration means bringing the data together so it makes sense
Semantic integration recipient! means bringing the data together so it makes sense.

Total logical data archiecture level

Presence or absences of entities / tables Cardinalities

Table (subject entity) level

Definitions are the same Field list are the same

Column (field) level

Formats are the same Business definitions are the same Domains & meanings are the same

A

B

=

Cust-A

Cust-B

=

B

A

=


Data integration involves matching things from multiple sources
Data integration involves matching “things” from multiple sources

Instance level:

Person

Store

Address

Vehicle

Neighborhood

Event (or episode)

Dimension level:

Time period

Brand or product

Market

Category (“type”, “class”)

Geography

Other grouping

Benefits from “singular” characteristic of entity

Problematic matching between sources


Northern New Jersey sales zone multiple sources

Long Island Sales Zone

Metro NY sales zone

Central NJ sales zone

New York media market


Name address formats
Name & address formats multiple sources

  • Are you going to do name and/or address matching?

  • Many causes of non-matches.

“They will have the name and addresses in the record.”

“Oh, that’s fine.”


Address formats parsing
Address formats -- parsing multiple sources

Source format:

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

…to be matched to…

05 FIRST_NAME PIC X(20)05 LAST_NAME PIC X(25)05 MIDDLE_INIT PIC X(01).05 STR_NUMBER PIC X(10).05 STREET_NAME PIC X(30). 05 APT_NO PIC X(10).

Target format


Address formats parsing 2
Address formats – parsing (2) multiple sources

Source format:

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

…to be matched to…

Target format

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

Are these going to match?


Address formats parsing 3
Address formats – parsing (3) multiple sources

Source format:

Ironically…

The meaning is the same, but the data is different.!

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

…to be matched to…

Target format

05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).

Are these going to match?


Address formats parsing 31
Address formats – parsing (3) multiple sources

Which of these will match in native SQL?

None.


Conclusion on data architecture
Conclusion on data architecture: multiple sources

  • Even if you have an exact physical format match…

  • …source to target…

  • Field names Field format

  • The contents may not match.

  • And the meaning may not match.


Semantics and meaning multiple sources

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions4
Key questions: multiple sources

  • Are the languages of text fields, and the character set appropriate to your needs?

  • Are numeric fields in units-of-measure which you expect?

  • How is the “null” condition symbolized in each field?

  • Is it clear what the business meaning of the null condition is?

  • What fields need to be translated into your desired coding domain?

  • Does the meaning of any field (or elements of its domain) change over time or over any other scope dimension?


Potential coding variations
Potential coding variations multiple sources


Thin documentation can be misleading
Thin documentation can be misleading multiple sources

  • “Address”

  • Current address?

  • Current address for mailing purposes

  • but not for billing purposes.

  • Current address for delivery purposes

  • but not for mailing, or billing.


“Null” multiple sources

  • Though the “null” value may be stored in the original database, …

  • …will it be transferred effectively through the ETL process?

  • There is also the question: “Why is it null?”

  • That answer can be another kind of metadata.

  • Not applicable

  • Declined to state

  • Will be supplied later


Testing for semantic discontinuities
Testing for semantic discontinuities multiple sources

Fields may change meaning over time (or other dimensions)

Codes may change meaning over time

Every code is potentially volatile over time.

Invoice typeAccount typeCustomer numberSales division

Stable codes tend to be OUTSIDE the organization…

…e.g. standard govt codes.


Are domains stable over time customer file invoice type code
Are Domains Stable Over Time? multiple sources Customer File: Invoice Type Code

  • INVOICE_TYPE_CODE XTAB AGAINST MONTH MONTH 01 02 03 04 05 06 07---------------------------------------------------------AA 87 91 96 78 88 92 97BB 142 148 153 162 149 167 173CC 197 204 211 225 0 0 0DD 45 48 51 47 46 48 49EE 77 76 81 79 84 82 79F1 4 3 8 5 9 7 11F2 9 8 4 7 12 9 8------------------------------------------------------------

Type “CC” not consistently used over time.


Are the codes consistent over time
Are the codes consistent over time? multiple sources

Cust. 3

Cust. 21

Cust. 41

Cust. 11

Cust. 7

Cust. 8

Cust. 16

Cust. 28

Cust. 24

Cust. 5

Cust. 29

Cust. 19

Cust. 6


Customers are grouped into regions
Customers are grouped into regions. multiple sources

Cust. 3

Cust. 21

Region 1

Cust. 41

Cust. 11

Cust. 7

Cust. 8

Cust. 16

Region 3

Cust. 28

Cust. 24

Region 2

Cust. 5

Cust. 29

Cust. 19

Cust. 6


Regions get redefined realigned
Regions get “redefined” - “realigned”. multiple sources

Cust. 3

If this happens…

…will the source tell you?

Can you detect it on your own?

Cust. 21

Cust. 41

Region 1

Cust. 11

Cust. 7

Cust. 8

Cust. 16

Region 3

Cust. 28

Cust. 24

Region 2

Cust. 5

Cust. 29

Cust. 19

Cust. 6


Introduction multiple sources

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Documentation & metadata


Key questions5
Key questions: multiple sources

  • Has format and meaning documentation been provided prior to your decision to acquire the data?

  • Is the documentation current?

  • Can you get sample data to test against?

  • Is the documentation thorough and in sufficient detail?

  • Does the documentation include data quality standards?


Documentation topics

Traditional “normative” data documentation covers only this.

Documentation topics

  • Format and structure

  • Meaning of fields and segments

  • Language & units of measure

  • Entity life cycle and extract filters

  • Scope

  • Vintage (date ranges)

  • Projections (GIS)

  • Reference (GIS)

Function of program code

Function of job parameters


Format alone does not describe data
Format alone does not describe data this.

Batch job step INV04G

Batch job step INV21K

Utah

Arizona

N.M.

California

Format-A

Format-B

Same format, but different scopes

Same scope and timing, but different format.


Data documentation must be more than format much more
Data documentation must be more than format … much more. this.

Metadata

“data about data”“information about data”“information about information”

Many kinds of metadata!

Industry and cultural contexts.

The word, “metadata” is inherently ambiguous.

Format(s)

Contentandmeaning

tangible data file


Technical vs business metadata
Technical vs. business metadata this.

01 CUSTOMER_MASTER. 05 CUST_NUM PIC X(08). 05 CUST_NAME PIC X(30). 05 ADDRESS_1 PIC X(30). 05 ADDRESS_2 PIC X(30). 05 CITY PIC X(25). 05 STATE PIC X(02).

“Customers in this file include…

current active customersprospective customersdormant customersrecipients of samples

Other subtypes include:

industrial vs. retaildomestic vs. internationalbroker vs. directplatinum vs. regular”

Well-structured, Machine-readable

Unstructured, meaningful only to a human.


Normative vs dynamic metadata
Normative vs. dynamic metadata this.

If the file is being updated, then source-ID and quality are NOT characteristics of the entire table.

Observed in 1985

Low quality

Source-A

This has nothing to do with structural metadata.

Observed in 2003

Source-B

High quality


Record level metadata
Record-level metadata this.

Source and update date for the whole record (all fields)

source of all info in this record

when record created or updated

all non-key data acquired as a single unit


Imbedded metadata cell level
Imbedded metadata – cell level this.

Some facts are acquired individually, unrelated to peer cells in a record.

Credit bureau record on person

These 3 data elements belong together.

fact

where we got the fact

when we got the fact


Where is metadata stored
Where is metadata stored? this.

Complex dataset

scattered

Metadata in 3-ring binder

copy

Metadata in XML

copy

Central Metadata Repository

Classic data management problem:

Two copies of knowledge, no rigorous enforcement of refresh and update.


Documentation standards
Documentation standards this.

  • Easy to establish, sometimes reluctant to fulfill.

  • Letter but not the spirit of documentation.

  • Nobody wants to write documentation

INVOICE_AMT DECIMAL (11.2) Def. Total amount of the invoice.

INVOICE_AMT DECIMAL (11.2) Def. This data element contains the total invoice amount.

Tautological: The use of redundant language

"If you don't get any better, you'll never improve" --Yogi Berra


Documentation standards1
Documentation standards this.

Good metadata discusses the anomalies!

  • Easy to establish, sometimes reluctant to fulfill.

  • Letter but not the spirit of documentation.

  • Nobody wants to write documentation

INVOICE_AMT DECIMAL (11.2)

Def. The total amount to be paid on a regular invoice to the customer; equals the sum of all extended costs of line items net of discounts. Also includes special charges unrelated to specific products. Always in U.S. dollar.

On invoice reversals, this field is normally negative. On credit memos, this field is normally negative.


Scope & completeness this.

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions scope
Key questions: Scope this.

  • Are you getting all the attributes (fields, columns, data elements) which you expect?

  • Are you getting other attributes you didn’t ask for?

  • Are you getting all the records you expected?

  • Are you getting any records outside of your scope of request or interest?

  • For each field, is the column populated as completely as is appropriate?


Are you getting the data you expect to get
Are you getting the data you expect to get? this.

  • Scope

  • Geography

  • Time

  • Range of customers by name, account, etc.

  • Is there any way your source might have truncated your input data?


Kinds of scope
Kinds of scope this.

  • Scope in time

  • Scope in geography

  • Organizational scope

  • Types or subtypes of major entities

  • Entity life cycle and duplication


Time scope
Time scope this.

Tally data over time

YEAR 1998 1999 2000 2001

MONTH ----------------------------------01 492 742 711 84102 512 701 782 81203 588 689 733 84504 522 0 746 82905 581 618 697 79206 566 682 709 84107 599 623 728 82308 492 593 692 78409 509 608 717 82410 527 631 729 78111 488 597 744 80712 611 845 714 892

TABLE FILE GG1SUM RECCNTACROSS MONTHBY YEAREND-RUN


Time scope cont
Time scope this.(cont.)

Probably a discontinuity in definition, inclusion criteria, or scope.


Cropping
Cropping this.

Order History Table

  • Count records by month

You may get more records than you expect!

A purge process exists, but some records had to remain (still outstanding dispute).


Kinds of scope geography
Kinds of scope: this. Geography

DATASET COVERAGE


Kinds of scope organizational
Kinds of scope: this. Organizational

  • All the divisions, or just some?

  • All the sales, or just sales by employee sales reps (thus excluding broker-negotiated sales)?

  • Domestic activity only or including international?

GargantuanIndustries, Inc.

Mining &minerals

Toys

Defense& weapons

Health & beauty aids


Kinds of scope types and subtypes
Kinds of scope: this. Types and subtypes

  • Vehicle file includes…

  • Owned vehicles but not leased vehicles

  • Cars but not utility trucks

  • Dataset of employees

  • Full-time but not part-time

  • Current but not former employees

  • Volunteers?


Subtypes of a hospital employee entity
Subtypes of a hospital employee entity this.

  • Employee[Emp Num]

  • Candidates

  • Activeempl.

  • Formerempl.

  • Doctors

  • Contractempl.

Are subtypes mutually exclusive?

Are some data fields present for some, but not all subtypes?


Subtypes of a hospital employee entity1
Subtypes of a hospital employee entity this.

  • Employee[Emp Num]

  • Candidates

  • Activeempl.

  • Formerempl.

  • Doctors

  • Contractempl.

  • Perm. Full-time

  • Temporary

Subtypes can have subtypes


Kinds of scope entity life cycle duplication
Kinds of scope: this. Entity life cycle & duplication

Big issue: mutual exclusivity of records, vs. duplication

Can the same instance be represented by multiple records…

…possibly in multiple stages of its life cycle?

Are all records logical peers to each other?


Students at a university
Students at a university this.

  • Student[ID Num]

  • Freshman

  • Sophomore

  • Junior

  • Senior

  • Graduate

In reality (business policy), mutually exclusive?

In a file of students, are you getting only one record per student? ….

Or, one record per student-year?


Students at a university1
Students at a university this.

  • Student[ID Num]

  • Freshman

  • Sophomore

  • Junior

  • Senior

  • Graduate

This gets us back to architecture.

What distinct subject entity does a record represent?

Name change between academic years?


Other fragments of scope
Other fragments of scope this.

File of prospective customers from outside source

  • MOST COMMON VALUES OF LAST_NAME-------------------------------BROWN 12,943DAVIS 9,542ANDERSON 7,227CLARK 5,344ALLEN 4,715CAMPBELL 4,014ADAMS 3,800BAKER 3,635EVANS 3,271COLLINS 3,180CARTER 3,143EDWARDS 3,129COOK 2,772COOPER 2,646

What’s wrong with this picture?


Look at distribution of first character of text fields
Look at distribution of first character of text fields! this.

  • Name

  • Address

  • City

  • Comments

  • Need query tool which can create new variables (fields) based on mask.


Distribution of first character last name field purchased input file
Distribution of first character, Last Name field, purchased input file.

  • A 8,240 N 86B 31,210 O 47C 17,221 P 24D 10,929 Q 13E 4,507 R 14F 8,081 S 4G 77 T 23H 71 U 21I 8 V 13J 63 W 2K 36 X 1L 94 Y 5M 82 Z 7


Reasonable surname 1st character distribution in american society
Reasonable surname (1st character) distribution in American society.

  • A 8,240 N 4,486B 31,210 O 3,347C 17,221 P 11,724D 10,929 Q 513E 4,507 R 12,864F 8,081 S 23,604G 11,977 T 8,623H 17,171 U 571I 1,008 V 3,453J 7,163 W 14,302K 8,636 X 36L 11,094 Y 1,435M 21,682 Z 1,187


Accidental truncation of data
Accidental truncation of data society.

  • What ways can your source truncate your data?

  • Name

  • Organizational (e.g. forgot broker sales)

  • Life cycle (e.g. forgot former employees)

  • Time (clipped in creation date range, but not ship date)


Are you getting more data than you wanted
Are you getting MORE data than you wanted? society.

  • Test records

  • Beyond original scope


Detecting duplicate data
Detecting duplicate data society.

  • Domain each key (if it is a truly unique key)

SELECT CUST_KEY, REC_COUNT FROM SELECT CUST_KEY, COUNT(*) AS REC_COUNT FROM CUST_MAST GROUP BY CUST_KEY;ORDER BY REC_COUNT DESCENDING;

CUST_KEY REC_COUNT-------------------004001 1004002 1004003 1004004 1004005 1

CUST_KEY REC_COUNT-------------------004127 5004039 4004113 4004834 3004225 3

Desired results

Duplicate keys


Detecting duplicate data cont
Detecting duplicate data society.(cont.)

  • Test for duplicate data, with non-dup keys

Entire record duplicated

Different key

Same person


Incremental updates

Customer Order society.Table

Incremental updates

January activity: 4 records

This is typical for extract to a data warehouse.

Source application

update file

translate & load

Datawarehouse

Customer Order

extract


Incremental updates1

Customer Order society.Table

Incremental updates

January activity: 4 records

February activity: 4 records

A change had been made to order 10004.Posted Feb. 4, AFTER the incremental extract for January data.


Incremental updates2

Customer Order society.Table

Incremental updates

February change file:

How it should look.

“U” Update“N” New

Dilemma: How far back into history must you look to be sure you have all the changes posted in February?

This places a burden on the source system!


Overlap of date ranges of update files
Overlap of date ranges of update files society.

Jan.

Feb.

Mar.

Apr.

May

Jan.

February

March

April

May

14 days back into previous month


Scope are all records peers to each other
Scope: Are all records peers to each other? society.

  • Can you have detail and summary records co-existing?

detail records

aggregate record


Beware of masking data for political or confidentiality reasons
Beware of masking data for political or confidentiality reasons.

Cells with data masked because total figure is too low.


World economic statistics
World economic statistics reasons.

Source: CIA web site


Reference data foreign keys
Reference data & foreign keys reasons.

  • Codes need interpretations!

  • Two places to do it: Documentation Active reference tables.

master files(kernel-stable)

transactions(events)

reference tables(validation)


Reference data foreign keys1
Reference data & foreign keys reasons.

small domain

large domain

Volatile

customer cd

vendor cd

DRG

employee

ICD-9

countries

product cd

transaction type

facility

invoice type

low volatility

U.S. states

Gender


Inconsistent coverage
Inconsistent coverage reasons.

  • “We did door-to-door interviews in the towns, but we are only estimating the rural areas of the county.”

  • Sampling and projection.

  • Examples of sampling and projection:

  • Exit polls

  • Radio & TV audience

More reliable statistics:

Journal & newspaper subs

Web ad viewing


Detecting estimates
Detecting estimates reasons.

Look at frequently-occurring values

  • MOST COMMON VALUES OF PLACE POPULATIONPOPULATION RECORDS ------------------------------- 25 9,542 100 7,227 200 5,344 50 4,715 150 4,014 300 3,635 250 3,180 400 3,143 125 3,129 120 2,772 40 2,573

Domain study, most frequent observed values.


Detecting estimates1
Detecting estimates reasons.

Look at low end of value range.

Spikes suggest estimates


Detecting estimates cont
Detecting estimates reasons.(cont.)

  • Day of month in dates:

  • Date of birth

  • Filing date

  • Posting date

Analysis requires a query tool which will extract day of month


Detecting estimates cont1
Detecting estimates reasons.(cont.)


Key questions best available
Key questions: Best available? reasons.

  • How does this dataset compare in quality with alternate sources?

  • Quality

  • Currency

  • Granularity

  • Price


Ask questions obvious but sound insulting
Ask questions! reasons.Obvious but sound insulting.

  • How many employees do you have?How many records in the Employee File you are sending us?

  • Are there any duplicate records in your file?How are they duplicate? Why?In a business sense, what does that mean?


Fundamentals of data quality reasons.

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions6
Key questions: reasons.

  • Are the values observed in each column valid?

  • For codes, to they conform to a consistent domain?

  • For quantities, do they conform to a reasonable range?

  • For quantities, are there any significant outliers?

  • Are the values observed in each column reasonable (given context)?

  • Are the values in each column accurate?

  • Is the definition for each field (and the data contained therein) consistent over the entire dataset?

  • What is the precision of each numeric field?

  • Is that precision consistent over the entire dataset?

  • How does this dataset compare in quality with alternate sources?


Again defining data quality
Again: reasons. Defining data quality.

  • “High quality data accurately describes reality, according to its complete definition.”

  • --Michael Scofield


Components of data quality
Components of data quality reasons.

  • Instance (row) present? (issue of scope of entire file)

?


Components of data quality1
Components of data quality reasons.

  • Instance (row) present? (issue of scope of entire file)

  • Cell populated? (need to recognize null condition)

?


Components of data quality2
Components of data quality reasons.

  • Instance (row) present? (issue of scope of entire file)

  • Cell populated? (need to recognize null condition)

  • Is value in cell valid? (compare against rules)

  • Is value in cell reasonable? (requires context)

  • Is value in cell accurate? (requires definition)

  • How precise is the data in the cell?

  • Is value in cell current? (time dimension of definition)

  • Is the definition consistent over all dimensions?


Completeness of data first definition
“Completeness” of data reasons.(first definition)

  • Complete table (all the rows)

  • Incomplete table (some of the rows)

  • 70% complete


Completeness of data 2nd definition
“Completeness” of data reasons.(2nd definition)

  • Complete table (all the fields)

  • Incomplete table (some of the fields)


Don t confuse validity with accuracy
Don’t confuse validity with accuracy! reasons.

  • Validity of data means it conforms to rules.

  • It is not necessarily reasonable

  • . . . or accurate.


Reasonability may be evaluated in context
Reasonability may be evaluated in context. reasons.

  • City Temperature F-----------------------PITTSBURGH 49

  • ERIE 95

  • CLEVELAND 96

  • HARRISBURG 89

  • PHILADELPHIA 88


Kinds of reports for data analysis
Kinds of reports for data analysis reasons.

  • Goal: Give visibility to data behavior.

  • 1. Domain studies

  • High value format Low value format

  • 2. Inter-field dependency tests

  • 3. Referential integrity tests

  • 4. Formatted dumps

  • 5. Other reasonability tests


Where do you look at the data
Where do you look at the data? reasons.

  • Staging database.

  • Exact replica of source data.

External data source

Target

database

Replica

Simple ETL

Complex ETL

describes

describes

same

Query tool

Data architectures


Update and refresh issues reasons.

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions7
Key questions: reasons.

  • Is this the only dataset you are going to acquire over time?

  • Or, are you going to get new versions or updates?

  • Will any updates you get be incremental (just the changes) or complete refreshment?

  • Are you going to get corrections as soon as the source knows about them?

  • How will the source differentiate updates from corrections?

  • Would they be found in the same dataset?


Update incremental vs full refresh
Update reasons.:Incremental vs. full refresh

Can you distinguish between legitimate changes vs. error corrections?

Can you detect changes or corrections? Do you need to know about them? (downstream propagation)

Saves time!

Simple processing.


Data collection bias reasons.

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions8
Key questions: reasons.

  • What was original the purpose of the data collection efforts?

  • What is your purpose or goal in acquiring the dataset?

  • What is the business value of the data to you?

  • What values and goals entered into the collection, organizing, and other preparation of the data?

  • What purpose does the source/provider have in making the data available? (profit, persuasion, altruistic)?


Two kinds of data generation2
Two kinds of data generation reasons.

Data as byproduct of business processes

Data as gathered as non-business research

Remember this?

commercial sector

banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.

governments

social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)

field surveys of land, topo, etc.

observations of external behavior:weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies

satellite & aerial imagery

Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.


Bias towards certainty
Bias towards “certainty” reasons.

  • Coded fields on data-entry screens discourage ambiguity…

  • …and encourage illusion of precision.

  • Premature entry of coded data results in non-nulls,

  • …and illusion of complete-ness.

  • Tabular structures demand that you can’t say “about” next to a piece of data.

  • Tabular data implies / expects precision

data ambiguity lite


About
“About” … reasons.

or, as they say in Canada…

  • Textual and non-tabular expression allow imprecision.

  • “When does your plane leave?”

  • “About 3 PM.”

  • Q. “How old is the suspect?”

  • A. “In his mid-forties.”

You cannot tabularize that!

data ambiguity lite


Do you understand the data gathering process
Do you understand the data gathering process? reasons.

  • What policies / methods / procedures / screens introduce data errors?

  • What are sources of bias?


Ownership, usage, and liability reasons.

Introduction

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion


Key questions ownership usage
Key questions: reasons.ownership & usage

  • Are you unlimited in the usage of the data?

  • Can you resell the data?

  • Can you share the data with other organizations?

  • What restrictions are placed on you in using the data?


Key questions ownership usage 2
Key questions: reasons.ownership & usage (2)

  • Does the source (person or organization) take any responsibility for the quality, completeness, accuracy of the data?

  • Are there any implied limits to the “suitable” usage of the data?

  • Are you planning on using the data for something it was not originally designed for?


Introduction reasons.

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Confidentiality


Key questions confidentiality
Key questions: confidentiality reasons.

  • Does the dataset contain any data which (other than its proprietary nature) should be considered confidential?

  • By what criteria?

  • Why might it be confidential or sensitive?

  • Who are the interested parties?


Strategic and competitive
Strategic and competitive reasons.

Adm. Yamamoto

To cover up the fact that the Allies were reading Japanese code, American news agencies were told that civilian coast-watchers in the Solomons saw Yamamoto boarding a bomber in the area

  • Even the fact that you simply have the data may be secret.

Bougainville

Ballale airfield

Henderson Field

Guadalcanal

Adm. Yamamoto-Isoroku


Introduction reasons.

Spelling out the Relationship

Data & information

Universe of knowledge

Data coming from bureaucracies

Asking for the right data

Potential data providers

Physical forms and media

Logical data architecture

Semantics & meaning

Documentation & metadata

Scope & completeness

Fund. of data quality

Update & refresh issues

Data collection bias

Ownership & legal

Confidentiality

Data flow surveillance

Conclusion

Data flow surveillance


Key questions9
Key questions: reasons.

  • Can you count on your source to notify you about any changes in…

  • logical data architecture

  • scope

  • quality

  • units of measure

  • precision

  • bias


Designing for the Future reasons.

Designing an on-going Surveillance Program for monitoring the stability of source data behavior, quality, and meaning, and the appropriateness of your mapping.

In other words….Preventing nasty surprises!


Designing on going data surveillance to protect yourself in the future
Designing on-going data surveillance to protect yourself in the future.

  • First time analysis is tedious.

  • Lots of exploration of the data.

  • Lots of decisions.

  • Do you want to do it with every incremental update of the database?

  • No, but you can’t assume that next month’s data will behave the same as the first tape.

  • Expect the unexpected.


Goal of testing imported data
Goal of testing imported data the future.

Semper vigilans

  • Protect yourself against injury because of errors or inconsistencies in imported data.

  • Be aware of changes to meaning of incoming data.

  • Be aware of changes of scope of incoming data.

  • Catch the problems as soon as possible… …not during the database update process.

  • Hence, make the loading process as fast and smooth as possible.

Always vigilant


The challenge of imported data
The challenge of imported data the future.

  • Each piece of data is an observation about reality, far from where you sit.

  • You cannot go out there and verify each piece of data which you import.

  • Even sampling is very difficult.

  • You can only test the data against. . . Absolute rules about behavior Reasonability tests to spot problems.


What can possibly go wrong
“What can possibly go wrong?” the future.

  • On updates, your data supplier can…

  • Stop populating a field

  • Filter out records for some reason

  • Redefine a code used by them internally

  • Re-use a field for a new meaning

  • Give you new data you didn’t expect

  • Change their source (and quality) of a given field


Watch for
Watch for . the future. . .

  • Unexpected changes in data architecture of source

  • New record types or segment Changes in cardinality between logical entities New fields Change in field length or usage

  • Unexpected changes in a field or column Changes in domain of valid values Changes in numeric behavior (e.g. going negative) Changes in null or “missing value” behavior

Semper vigilans


Kinds of tests of imported data
Kinds of tests of imported data the future.

  • Conformance to absolute rulese.g. Valid value tests, etc. Rules relative to…. 1. Expected values 2. Own record 3. Other records

  • Reasonability testing Detecting anomalous behavior based on context (e.g. “This doesn’t seem right!”) Contexts and scope: 1. Own record 2. Own tape 3. Prior tapes from this source 4. Whole database


Flow of data through tests ideal vision
Flow of data through tests the future. (ideal vision)

New data

Reject (suspense) whole tapeReject (suspense) this record

Data rules

Absolute checks

Sound warning andhold tape (record) for further tests or explanation.

Reasonability tests

OK (so far)

Context

Database

temp

Scrub


Test incoming data a s a p
Test incoming data A.S.A.P. the future.

  • Test the data as soon as you have the source available, … not when you are updating the database or DW.

  • Test the data before you do any scrubbing.

  • If you do scrub, apply original tests and other tests again after scrubbing.


Data ambiguity
Data ambiguity the future.

  • Tabular structures seduce us into thinking all cells have equal reliability.

  • Not necessarily so!

data ambiguity lite


Summary conclusion
Summary & conclusion the future.

  • Understand your users and their data needs.

  • Understand the politics of your source.

  • Do they have reason to be guarded?

  • Understand the burden your request for data places upon your source.

  • Decide if you need one-time, or repeated updates.


Summary conclusion cont
Summary & conclusion the future. (cont.)

  • Before signing the agreement…

  • Review the data documentation

  • Understand the data architecture and paradigm

  • Thoroughly test some sample data

  • Be sure it conforms to your expectations…

  • format (case, etc.) scope unit of measure quality and precision

Try to break it!


The end unless we keep going michael scofield nmscofield@aol com
The End the future. …unless we keep goingMichael [email protected]

“No vegetables were harmed in the making of this presentation.”


ad