Biographynet managing provenance at multiple levels and from different perspectives
Download
1 / 33

BiographyNet Managing Provenance at multiple levels and from different perspectives - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

BiographyNet Managing Provenance at multiple levels and from different perspectives. 21 October 2013. Niels Ockeloen, Antske Fokkens , Serge ter Braake , Piek Vossen , Victor de Boer, Guus Schreiber, and Susan Legêne The Network Institute, VU University Amsterdam http://wm.cs.vu.nl.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' BiographyNet Managing Provenance at multiple levels and from different perspectives' - inge


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Biographynet managing provenance at multiple levels and from different perspectives
BiographyNetManaging Provenance at multiple levels and from different perspectives

21 October 2013

  • Niels Ockeloen, AntskeFokkens, Serge terBraake, PiekVossen, Victor de Boer, Guus Schreiber, and Susan Legêne

  • The Network Institute, VU University Amsterdam

  • http://wm.cs.vu.nl

BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


Overview
Overview

Overview of this presentation

  • Introduction of the project

  • Short overview of use cases

    • Illustrative use case example

  • Why provenance is important

    • Requirements from the perspective of the Historian

    • Requirements from the perspective of the Computer scientist

  • The BiographyNet schema

    • Foundations

    • Extending the schema with Provenance

    • Aggregated provenance information

    • Detailed provenance information

BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


What is biographynet
What is BiographyNet?

BiographyNet: Extracting relations between people, places and historic events

  • Multidisciplinary E-HistoryProject

BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


What is e history
What is E-history?

E-humanities

Investigates what can be done in humanities with modern

techniques which we could not do before, or only with a

great deal of effort

E-history

  • Sub domain of E-humanities which aims at improving existing methods

  • of historical research rather than introducing

  • a whole new way of doing historical research *

  • * Zaagsma, G.: Doing history in the digital age: history as a hybrid practice (2013) http://gerbenzaagsma.org/blog/16-03-2013/doing-history-digital-age-history-hybrid-practice

BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


What is biographynet1
What is BiographyNet?

BiographyNet: Extracting relations between people, places and historic events

  • Multidisciplinary E-HistoryProject

BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


What is biographynet2
What is BiographyNet?

BiographyNet: Extracting relations between people, places and historic events

  • Multidisciplinary E-HistoryProject

  • Funded by the Netherlands eScience Center

  • Partners are the Netherlands eScience Center, the Huygens/ING Institute of the Royal Dutch Academy of Sciences and VU University Amsterdam

  • Starting Point: The Biographical Portal of the Netherlands http://www.biografischportaal.nl

    • 125,000 short biographical descriptions with limited meta data from a variety of Dutch biographical dictionaries

    • 76,000 individuals

  • BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Short biographical descriptions with limited meta data
    Short biographical descriptions with limited meta data

    Individuals with available information (%)

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Project goals
    Project Goals

    Main project goals

    • Provide a richer historic knowledge base by creating a semantic layer on top of the data from the Biographical Portal

      • Convert the available data to RDF (first conversion available)

      • Enrichments (NLP) and Aggregations

      • Link to other sources

    • Inspire Historians in setting up new research projects by providing them with interesting leads

      • Development of a demonstrator

      • Quantitative analysis, visualisation and browsing techniques

    • Re-usable deliverables

      • Open-source release of the platform for analyzing texts about people

      • Methodology for extraction of a relation network between people, places and events

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Use case overview
    Use Case Overview

    Currently 12 use cases developed involving quantitative analysis, relation discovery, thematic research, etc.

    • Simple:

      • Group analysis of Governors-general of the Dutch Indies

    • More complex:

      • When did Dutch elites get involved with the ‘New World’?

    • Highly complex:

      • What can we say about nationalism in biographical dictionaries from the nineteenth and twentieth century?

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Illustrative use case
    Illustrative use case

    Governors-General of the Dutch Indies

    • Highest Official in the Dutch Indies (1610-1949)

      • 129 Biographies describing 71 individuals

    • What can we say about these men as a group?

    • What properties did they need to have to be appointed?

      • Personal qualities

      • Relations (already more difficult)

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Governors general data mining
    Governors General: Data Mining

    Focus on the following information

    • Family connections

      • Parents

      • Partner

      • Children

    • Dates

      • Birth

      • Appointment

      • Death

    • Motivation

      • Education

      • Religion

      • Reasons for appointment

      • Reasons for leaving the office

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Governors general time and effort
    Governors General: Time and effort

    Manual analysis

    “More than one full week to manually mine this information from the Biography Portal.” (Serge terBraake)

    The question

    “Can a historian do this with (almost) the same results in less than an hour when using the demonstrator?”

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Text mining using natural language processing nlp
    Text mining using Natural Language Processing (NLP)

    Basic System for data enrichment using text:

    • Identifying meta data in text

      • Linguistically naïve supervised machine learning

    • Linguistic processing

      • Detection of (co-referenced) named-entities (persons, places and dates) and events

      • Concept identification

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Nlp challenges
    NLP: Challenges

    Challenges for NLP within BiographyNet:

    • Deal with alternative spelling

      • Texts vary from 19th century Dutch to contemporary Dutch

      • Variations in the naming of people and places

    • OCR-ed texts contain errors

    • Used methods may introduce bias:

      • Example: Location identification with GeoNamesHeuristic: On multiple possibilities, take the one in, or closest to The Netherlands

      • Problem: ‘America’ is a place in The Netherlands, but what about trade with the new world?

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Nlp preliminary results governors
    NLP: Preliminary results – Governors

    Presence of information in text vs. meta data (% on 71 individuals)

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Towards the demonstrator
    Towards the demonstrator

    Before development of the actual demonstrator can commence, we first need to:

    • Convert the data of the Biography Portal to RDF

      • Prevent loss of information

    • Devise a schema

      • Structure the data

      • Provide compatibility with other interesting sources

      • Facilitate the recording of provenance information on the manipulation of the data

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Requirements from the perspective of the historian
    Requirements from the perspective of the Historian

    Two main requirements for the demonstrator:

    • A trace back to all original sources (texts and meta data) involved in producing a certain result

      • Which sources were used for the overall outcome and how often?

      • What potentially relevant data was excluded from the end result?

      • Which piece of data led to a specific result (e.g. the age of a specific governor at his appointment)?

    • Insight in the processes manipulating and selecting the data

      • Indication of overall performance: Focus on recall or precision?

      • Global description of the used heuristics should be provided

      • Indication of responsibility: Who to contact when results are pulled into question?

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Requirements from the perspective of the computer scientist computational linguist
    Requirements from the perspective of the Computer Scientist / Computational Linguist

    Reproducing results:

    • Reproducing results in NLP is non-trivial

      • Details in implementations or experimental setup can influence results up to a point where they tell a different story

    • Clear registration of all steps involved and storage of intermediate system output can improve reproducibility

    • Systematic testing can help to gain insight into the variation of the outcome of our systems and hence lead to more insight in their performance

      AntskeFokkens, Marieke van Erp, Marten Postma, Ted Pedersen, PiekVossen and NunoFreire (2013) Offspring from Reproduction Problems: What Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria, August 2013.

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Requirements from the perspective of the computer scientist computational linguist1
    Requirements from the perspective of the Computer Scientist / Computational Linguist

    Translation into requirements for the demonstrator:

    • Facilitate Replication and Reproduction

      • Recording of information on used tools such as Creator, version number, etc.

      • Recording of any kind of pre- / post-processing done on input/output data.

      • Recording of the intention behind the various steps in the NLP pipeline, including made assumptions and possible biases.

      • Intermediate results need to be preserved for debugging purposes

    • The schema needs to be both generic and flexible

      • NLP pipeline design can change

      • Tools and their formats unclear towards the future

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    The biographynet schema
    The BiographyNet Schema

    Foundations of the schema:

    • Based on the structure of the original XML files

    • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data

    • Needs to facilitate the incorporation of several enrichments, following from NLP, as well as aggregations

    • Compatible with existing schemas such as the Europeana Data Model,PROV, P-PLAN, DC terms, etc.

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    The conversion process
    The conversionprocess

    • <XML> Very simplified BP XML Example

    • <BioDes>

    • <FileDes> Source Meta Data

    • <Author></Author>

    • </FileDes>

    • <PersonDes> PersonMeta Data

    • <Name></Name>

    • </PersonDes>

    • <BioPart> Biographical Text

    • <Snippet></Snippet>

    • <BioPart>

    • </BioDes>

    • </XML>

    Purely syntactic conversion

    • Preserve the original structure of the data

    • Prevent los of information

    • Allow for reinterpretation of the original data in the future

    Data Preservation

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    The conversion process1
    The conversionprocess

    Conversion steps:

    • Retrieval of XML dump of the Biography Portal

    • Initial conversion to ‘crude’ RDF

      • Using ClioPatria and the XMLRDF tool for ClioPatria

    • RDF restructuring

      • Correction of purely syntactic inefficiencies in the data

    • TODO: Linking to other sources

      • Essential step in the ‘Linked Data’ philosophy

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Adding provenance information
    Adding Provenance Information

    Provenance information is information on how Entities come into existence

    • What are entities?

      • Documents, Articles, Pictures, etc.

      • Basically anything that can be ‘produced’ by something or someone

    • What kind of information?

      • Who did what?

      • Using which entities?

      • In which processes?

    • Why use the PROV-DM, i.e. PROV-O?

      • PROV-DM now an official W3C recommendation

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Provenance in biographynet
    Provenance in BiographyNet

    Based on the requirements for the demonstrator, provenance needs to be modeled:

    • From several perspectives:

      • Information involved  Sources, but also: NER input data, etc.

      • Processes involved  All steps in enrichment, aggregation, etc

      • People involved  Who was responsible for pipeline, tool, etc.

    • At multiple levels:

      • An aggregated level,  Targeted at the Historiani.e. per enrichment

      • A detailed level, i.e. all  Targeted at the Computer Scientist and individual processes  computational linguist

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Recap why is provenance info important for biographynet
    Recap: Why is provenance info important for BiographyNet?

    Needed to ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool

    • One needs to be able to validate results

      • Replication: Retrieving the same results later using the demonstrator

      • Reproducibility: Manually by the historian

    • The aggregated level – Targeted at the historian

      • Which original sources where involved?

      • Who to contact in case results are pulled into question?

    • The detailed level – Targeted at the computer scientist

      • Detailed information on each individual step

      • Allows for debugging the internal processing pipeline

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Biographynet schema illustration
    BiographyNet: Schema illustration

    http://www.biographynet.nl/schema

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    BiographyNet

    Enrichment example

    File

    Meta Data

    NNBW

    “Thorbecke”

    Biographical Description

    Person

    Meta Data

    Birth

    Event

    1798

    Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

    Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

    Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

    Biography

    Parts

    prov:plan

    Thorbecke

    Enrichment

    NLP Pipeline

    Biographical Description

    Person

    Meta Data

    Birth

    Event

    1798-01-14

    Zwolle

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    More than just provenance
    More than just Provenance:

    Provenance and Plans (P-PLAN):* Represent the plans that guided the execution of scientific processes

    • ‘Plans’ describe the original idea behind an activity

      • Each ‘Plan’ can consist of one or more ‘Steps’

      • Each ‘Step’ corresponds to an ‘Activity’

    • ‘Variables’ describe the input/output of an activity

      • Structure, format, quantity, etc.

      • Each ‘Variable’ corresponds with an input/output ‘Entity’ of an ‘Activity’

    • ‘Plans’ have their own provenance info

      • E.g. who was responsible for the creation of a plan?

        *Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Why model plans besides provenance
    Why model plans besides provenance?

    P-PLAN is used to not only model what actually happened, but also what was supposed to happen

    • Forces the recording of what an activity and its input/output should look like

      • Provides abstract description of original idea behind activity

      • As such, can provide info on heuristics and assumptions

    • Allows for comparing the actual activity and its input/output withthe original plan and its variables

      • Do they differ from each other and to what extend?

      • Makes finding errors much easier, as more information is available about what the input/output should look like

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Biographynet schema illustration1
    BiographyNet: Schema illustration

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Variable

    Variable

    Plan

    Plan

    Agent

    Person

    Association

    Agent

    NLP Tool

    Entity

    Activity

    Entity

    Activity


    Current status
    Current Status

    Main components of the demonstrator

    • Initial schema available

      • Schema models enrichments and aggregations alongside original sources

      • Allows for storing various levels of provenance information

      • Model will be adapted while progressing with building the demonstrator

    • Initial conversion to RDF available

      • Structure according to devised schema

      • Next step is linking to external sources

    • Initial NLP system setup available

    • Interface

      • First ideas and sketches

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    Thank you for your attention www biographynet nl feel free to ask questions
    Thank you for your attentionwww.biographynet.nlFeel free to ask questions

    BiographyNet: Managing Provenance at multiple levels and from different perspectivesLinked Science (LISC) – ISWC 2013, Sydney, Australia – 21 October 2013


    ad