Automatic extraction from and reasoning about genealogical records a prototype
Download
1 / 40

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype. By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science. Digital Images – Human Index.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Automatic Extraction From and Reasoning About Genealogical Records: A Prototype' - emera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatic extraction from and reasoning about genealogical records a prototype

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

By

Charla J. Woodbury

A thesis submitted to the faculty of

Brigham Young University

in partial fulfillment of the requirements for the degree of

Master of Science


Digital images human index
Digital Images – Human Index Records: A Prototype

  • Large number of competing family history websites

    • Digital images

    • Human indexes

  • Researchers hunting through records and indexes

  • to put families together

  • 2


    Problem
    Problem Records: A Prototype

    • Large amounts of primary genealogical data

    • Big projects to index and extract records

    • Two independent indexers and adjudication

    • Millions of human hours used to index or match records for names and families

    3


    Automated extraction solution
    Automated Extraction Solution Records: A Prototype

    • Create a specialized extraction ontology to interpret and label genealogical data

    • Add rules and logic that

      • Label family roles - husband, daughter, etc.

      • Link family relationships

        • HUSBAND – WIFE

        • PARENT – CHILD

    4


    Outline
    Outline Records: A Prototype

    5

    Data Preparation

    Ontology Extraction System (OntoES)

    OWL File and SWRL Rules

    SPARQL Queries

    Experimental Results

    Conclusions


    1 data preparation
    1. Data Preparation Records: A Prototype

    • Collect machine-readable records from three different countries

    • Format in HTML format for extraction

    • Prepare lexicons for names, places, etc.

    6





    South petherton marriages from genuki
    SOUTH PETHERTON 1668-1849MARRIAGES (from genuki)

    10


    Transcript of the original record
    Transcript of the original record 1668-1849

    1576/1577 eodem die Nicholaus Patch Christinam Denman

    26 Jan 1605 Richard Patch et Joanna Lavor

    1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis cominguntur

    1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur

    1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio cominguntur

    1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch

    1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio cominguntur

    1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio cominguntur

    1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio cominguntur

    1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio cominguntur


    2 ontology extraction system
    2. Ontology Extraction System 1668-1849

    OntoES: automatically interpret and correctly label genealogical data using

    • Data frames

      • Regular expressions

      • Lexicons

      • Date conversion methods

    12


    Marriage ontology
    Marriage Ontology 1668-1849

    13


    Data frame editor
    Data Frame Editor 1668-1849

    14


    Regular expressions
    Regular expressions 1668-1849

    • MARDATE

      • Value expression

        • EXAMPLE type 25-Sep 1613

          (0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\d\d)

      • Keyword expression

        (\b(md\.?|marry|marriage|married|maried|wed|wedding)\b)


    Sample month lexicon

    1Ober 1668-1849

    7ber

    8ber

    9ber

    apr

    april

    aprilis

    aug

    august

    augusti

    augustus

    avr

    avril

    avrilis

    dec

    december

    decembr

    decembre

    decembri

    feb

    febr

    februari

    february

    jan

    januarij

    january

    jul

    juli

    julius

    july

    jun

    june

    Sample MONTH LEXICON

    16


    Object level
    Object Level 1668-1849

    17


    CANONICALIZATION METHODS 1668-1849inside the ontology

    • Regularize date (Julian format: YYYYddd)

      1620 2-May → 1620093

    • Display stored Julian format as DD MMM YYYY

    • 1620093 →2 MAY 1620

    18


    Feast dates
    Feast Dates 1668-1849

    Dates expressed as a holy day

    • Fixed Dates

      Christmas 1720 →25 DEC 1720

    • Moveable Dates around Easter

      (36 possible Easter dates with leap year variation)

      • 1723 Dnica Septuagesima →24 JAN 1723

    • Same day as previous entry

    19


    Run ontology
    Run Ontology 1668-1849

    • Input

      • Ontology (Created with OntoES)

      • HTML data (Hypertext Markup Language)

    • Output

      • RDF database (Resource Description Format)

      • OWL file (Ontology Web Language)

    20


    Ontology workbench
    Ontology Workbench 1668-1849

    21


    Extracted marriages
    Extracted Marriages 1668-1849

    22


    3 owl file and swrl rules
    3. OWL File and SWRL Rules 1668-1849

    • OWL HEADER

      • <owl:Class rdf:ID="MarriageRecord"/>

      • <owl:Class rdf:ID="Person"/>

      • <owl:Class rdf:ID="NameU"/>

      • <owl:DatatypeProperty rdf:ID="NameUValue">

      • <rdfs:domain rdf:resource="#NameU"/>

      • <rdfs:range rdf:resource="&xsd;string"/>

      • </owl:DatatypeProperty>

    • PERSON - NAMEU

      • <owl:ObjectProperty rdf:ID="Person-NameU">

      • <rdfs:domain rdf:resource="#Person"/>

      • <rdfs:range rdf:resource="#NameU"/>

      • <owl:inverseOf>

      • <owl:ObjectProperty rdf:ID="NameU-Person"/>

      • </owl:inverseOf>

      • </owl:ObjectProperty>


    Sample rdf triples
    Sample RDF Triples 1668-1849

    Person_10 | sameAs | Person_10

    Person_10 | type | Thing

    Person_10 | type | Person

    NameU_0 | NameUValue | “Christian Denman”

    NameU_0 | sameAs | NameU_0

    NameU_0 | type | Thing

    NameU_0 |type | NameU

    NameM_4 | NameMValue | “Nicholas Patch”

    NameM_4 | sameAs | NameM_4

    NameM_4 | type | Thing

    NameM_4 |type | NameM


    Swrl rules
    SWRL Rules 1668-1849

    • Define OWL Class

      • Example – Husband

      • <owl:Class rdf:ID="Husband"/>

    • Define Rule

      • Example – Person with male name is a Husband

      • Person-NameM(?x,?y) -> Husband(?x)

    ?y

    ?x

    25


    Marriage person 10 to person 4
    Marriage – Person_10 to Person_4 1668-1849

    • Person_10

      • <Person rdf:ID="Person_10">

      • <Person-NameU rdf:resource="#NameU_0" />

      • </Person>

    • MarriageRecord_7

      • <MarriageRecord rdf:ID="MarriageRecord_7">

      • <MarriageRecord-Person rdf:resource="#Person_4" />

      • <MarriageRecord-Person rdf:resource="#Person_10" />

      • </rdf:MarriageRecord>

    • NameM_4

      • <NameM rdf:ID="NameM_4">

      • <NameMValue> Nicholas Patch</NameMValue>

      • </NameM>

    • Person_4

      • <Person rdf:ID="Person_4">

      • <Person-NameM rdf:resource="#NameM_4" />

      • </Person>


    Rule head in owl file
    Rule HEAD in OWL file 1668-1849

    <swrl:Imp rdf:ID="Def-Husband">

    <swrl:head rdf:parseType=“Collection”>

    <swrl:ClassAtom>

    <swrl:argument1 rdf:resource="#x"/>

    <swrl:classPredicate rdf:resource="#Husband"/>

    </swrl:ClassAtom>

    </swrl:head>


    Rule body in owl file
    Rule BODY in OWL file 1668-1849

    <swrl:body>

    <swrl:IndividualPropertyAtom>

    <swrl:propertyPredicate rdf:resource="#Person-NameM"/>

    <swrl:argument1 rdf:resource="#x"/>

    <swrl:argument2 rdf:resource="#y"/>

    </swrl:IndividualPropertyAtom>

    </swrl:body>

    </swrl:Imp>


    Related rules
    Related Rules 1668-1849

    • NameF is populated then value in NameU is Husband

      Person-NameF(?w,?v)  MarriageRecord-Person(?z,?w) 

      MarriageRecord-Person(?z,?x) Person-NameU(?x,?y)

      -> Husband(?x)

    ?z

    ?v

    ?w

    ?x

    ?y

    29


    Husbandof rule
    HusbandOf 1668-1849 Rule

    Husband(?x)  Wife(?y)  MarriageRecord-Person(?z,?x)

     MarriageRecord-Person(?z,?y)

    -> HusbandOf(?x,?y)

    30


    Auxiliary name rules

    NameM(?x) -> Name(?x) 1668-1849

    NameF(?x) -> Name(?x)

    NameU(?x) -> Name(?x)

    NameMValue(?x) -> NameValue(?x)

    NameFValue(?x) -> NameValue(?x)

    NameUValue(?x) -> NameValue(?x)

    Person-NameM(?x,?y) -> Person-Name(?x,?y)

    Person-NameF(?x,?y) -> Person-Name(?x,?y)

    Person-NameU(?x,?y) -> Person-Name(?x,?y)

    Auxiliary Name Rules

    31


    Sparql query who is husband of christian denman
    SPARQL Query 1668-1849Who is Husband of Christian Denman?

    PREFIX : http://www.deg.byu.edu/ontology/Marriage#

    SELECT ?Husband

    WHERE

    {

    ?X :NameValue "Christian Denman" .

    ?Y :Person-Name ?X .

    ?W :HusbandOf ?Y .

    ?W :Person-Name ?V .

    ?V :NameValue ?Husband

    }

    32


    Query results
    Query Results 1668-1849

    Husband

    =======================================

    "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string

    33


    Query results1
    Query Results 1668-1849

    Husband

    =======================================

    "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string

    South Petherton Marriages

    same day 1576 Nicholas Patch and Christian Denman

    26 Jan 1605 Richard Patch and Joan Lavor

    25-Sep 1613 John Elliott and Joan Woodbery

    7-Aug 1615 Thomas Prime and Maria Parry

    29-Jan 1616 William Woodbery and Elizabeth Patch

    2-May 1620 William Hillerd and Fortu: Patch

    17-Sep 1622 Nicholas Patch and Elizabeth Owsley

    22-Jan 1627 Richard Patch and Mary White

    15-Jan 1630 Andrew Elliott and Joan Patch

    12-Feb 1639 Andrew Elliott and Joan Pitts

    “Nicholas Patch” because:

    NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”)

    and Name(n1) is NameM(n1) and Person-NameM(p1, n1)

    NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”)

    and Name(n2) is NameU(n2) and Person-NameU(p2, n2)

    Husband(p1) because:

    Person-NameM(p1, n1)

    Wife(p2) because:

    Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1)

    and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1)

    HusbandOf(p1, p2) because:

    Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1)

    and MarriageRecord-Person(r1, p1)

    34


    5 experimental results
    5. Experimental Results 1668-1849

    • Extraction Results

    • American Extraction Problem

    • Rule Results

    35


    Extraction results
    Extraction Results 1668-1849

    36


    American difficulty
    American Difficulty 1668-1849

    BIRTH

    WOODBURY, Charles Henry [Charles William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, 1845.

    • Extra information inside brackets & parentheses

      • Charles William – twin of Charles Henry

      • Henry [housewright – identified as NAME

      • Henrietta (Galloup) –identified as NAME

    37


    Rules results
    Rules Results 1668-1849

    • 100% Precision and Recall

      (Once rules are well-defined, the results are perfect.)

    • Database Size

      (The RDF database is much larger when rule triples are added.)

      • NEW PROPERTIES – husband, wife, parent, child

      • NEW LINKS

    38



    6 conclusions
    6. Conclusions 1668-1849

    • Speed up data indexing

    • Make production of a full index easier

    • Ground the index in original documents

    • Provide for inferred facts

    • Simplify as well as augment record search

    • Help link records and form family groups and ancestral lines

    40


    ad