Historical data integration based on collective intelligence
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Historical Data Integration based on Collective Intelligence PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Historical Data Integration based on Collective Intelligence . Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh. NADM Group. Challenge. Consolidated Structured Information. WHD Data Integration

Download Presentation

Historical Data Integration based on Collective Intelligence

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Historical data integration based on collective intelligence

Historical Data Integration based on Collective Intelligence

Vladimir Zadorozhny

Graduate Information Science and Technology Program

School of Information Sciences

University of Pittsburgh

NADM Group

WHD Colloquium, March 27, 2012


Challenge

Challenge

Consolidated

Structured

Information

WHD Data Integration

Infrastructure

Diverse ,

Heterogeneous,

Semi-structured

Data Sources


Web of data

Web of Data?

  • Linked Data: using the Web to create typed links between data from different sources

  • Linked Data uses RDF (Resource Description Framework) to make typed statements (triples)

  • Expected result: Web of Data extending the Web with a global data space connecting diverse domains (people, companies, publications , etc.)

  • In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization

  • While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure

WHD Colloquium, March 27, 2012


Dataverse network

Dataverse Network?

  • An open source application to publish, share, reference, extract and analyze research data that facilitates making data available to others

  • "Dataverse owners can upload any file type and format (excel, txt,pdf, doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf)

  • Information consumers should further integrate data sources to perform analysis using multiple "dataverses".

  • While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data.

  • To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task.

WHD Colloquium, March 27, 2012


General whd architecture

General WHD Architecture

Data Submission

System

Information Consumers

Structured homogeneous

historical data

Annotated historical data

Fused

historical data

Internal

Data

Reliability

Assessment

External

Data Reliability

Assessment

Data

Fusion

Wrapper

Registration

Wrapper

Wrapper

Wrapper

Generation

Heterogeneous

historical data sources

Information Providers

WHD Colloquium, March 27, 2012


Historical data integration based on collective intelligence

Simple Scenario

select * from Population

WHD Infrastructure

Extendable Target Schema (relational is not mandatory):

Source | Location | From | To | Population |

Mapping:

Territories -> Location

Population -> Population

Data Aggregation -> Total

Year -> From,To

Mapping:

region -> Location

Population -> Population

Data Aggregation -> Total

Year -> From,To

Wrapper

Keep Data

Remotely

Wrapper

Materialize

Data

Source|Location | From |To | Population|

s2 | Liberia | 01/01/1950 | 12/31/1950| 824000 |

s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 |

s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 |

s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 |

s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 |

s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 |

s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 |

s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 |

Data Source: s1 (xl)

Data Source: s2 (doc)

According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly


Big picture continuously growing infrastructure a la wikipedia

Big Picture: continuously growing infrastructure (a la Wikipedia)

WHD Infrastructure

Data Utilization

Data Curation

Data Collection

WHD Colloquium, March 27, 2012


Whd prototype

WHD Prototype

  • Group of graduate IS students: specialproject in Advanced Data Management class (INFSCI2711)

  • Content Management → Pligg( Open Source Content Management System, Apache, PHP, and MySQL based)

  • Data IntegrationEngine→ PentahoKettle(Open Source Data IntegrationEngine, Java-basedGUI and Command Line Tools, XML baseddata transformation file)

  • Data providers

    • downloadWrapperGenerating Software

    • configurewrappers on theirworkstation ( usingpreconfiguredtemplates)

    • registerwrappers on WHD Server

WHD Colloquium, March 27, 2012


Historical data integration based on collective intelligence

Data Source

Data Transformation

Transformed Data

XML Wrapper


Historical data integration based on collective intelligence

WHD Colloquium, March 27, 2012


Data reliability assessment and data fusion

Data Reliability Assessment and Data Fusion

  • The systems based on crowdsourcing require mechanisms to ensure data quality.

  • WHD Infrastructure will support efficient data curationstrategies based on advanced data reliability assessment anddata fusionmethods.

  • As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence.

  • WHS uses a measure of inconsistency caused by a report to assess its internal reliability.

  • WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability.

  • WHD utilizes subjective logic to combine internal and external reliabilityassessment


Historical data redundancy

Historical Data: Redundancy

Temporal Overlaps

t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700

t2 | source_ref2 | Measles| NYC |10/20/1910 | 10/30/1930 | 300

Total number of Measles cases in New York City from 1900 to 1930:

700+300 = 1000 ??? Temporal overlap between t1 and t2

500 (NY)

600 (NYC)

Smallpox reports:

700

Spatial Overlaps

Measles reports:

300

t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500

t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600

1900

1900

1910

1910

1920

1920

1930

1930

Total number of Smallpox cases in New York State from 1900 to 1930:

500+600 = 1100 ??? Spatial overlap between t3 and t4

Naming Overlaps

t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700

t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700

t7 | source_ref4 | Hepatitis B| NY| 10/20/1910 | 10/30/1930 | 300

Total number of Hepatitis cases in New York State from 1920 to 1930:

700+700+300 =1700 ??? Naming overlap between t5, t6 and t7


Historical data inconsistency

Historical Data: Inconsistency

R1:

700

200

Measles reports in NYC:

R2:

500

400

300

……….

Redundant and Inconsistent :

time

WHD Colloquium, March 27, 2012


Information consumer toolset data visualization dashboard

Information Consumer Toolset:Data Visualization Dashboard


Icts map exhibits and timeline widgets

ICTS: Map Exhibits and Timeline Widgets


Icts motion chart animation

ICTS: Motion Chart Animation

CV

CV

CV


Conclusion

Conclusion

  • We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence

  • We implement this approach in WHD infrastructure for consolidation heterogeneous historical data

  • Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository?

    • contributions from CHAI members (only a small fraction of Wikipedia users contributes information to ensure its growth)

    • as the infrastructure evolves users may become interested in “embedding” their data in a larger context to perform global analysis and to utilize WHD tools

    • open development platform (extendable data transformation library and toolsets)

WHD Colloquium, March 27, 2012


Acknowledgements

Acknowledgements

Doctoral Students:

Ying-Feng Hsu

Julian Lee

Graduate IS Students (WHD system development team):

Andrew Barnett (team leader)

Andrew Entin

Thomas Junker

JidapaKraisangka

Han Liao

Eric Miller

Ye Peng

Evan Pulgino

Henry Quattrone

Mark Swartz

Miao Tan

Liu Yuchen

Lihong Zhang

WHD Colloquium, March 27, 2012


  • Login