Semiautomatic domain model building from text data
Download
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

Semiautomatic domain model building from text-data PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on
  • Presentation posted in: General

Semiautomatic domain model building from text-data. Petr Šaloun Petr Klimánek Zdenek Velart. SMAP 2011, Vigo, Spain, December 1-2, 2011. Introduction and goals. The basic tasks in creating a domain model: selection of domain and scope consideration of reusability

Download Presentation

Semiautomatic domain model building from text-data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Semiautomatic domain model building from text data

Semiautomatic domain model building from text-data

Petr Šaloun

Petr Klimánek

Zdenek Velart

SMAP 2011, Vigo, Spain, December 1-2, 2011


Introduction and goals

Introduction and goals

  • The basic tasks in creating a domain model:

    • selection of domain and scope

    • consideration of reusability

    • finding a important terms

    • defining classes and class hierarchy

    • defining properties of classes and constraints

    • creation of instances of classes

  • Goals

    • designing a method for semiautomatic domain creation

    • different input documents

    • different languages

    • design and implementation of tool


State of the art

State of the art

  • Algorithm and tasks work with domain model

  • different document formats

  • different languages

  • domain model

    • concepts, relations

    • domain model creation = time consuming

      • manual creation

      • automatic creation

      • semiautomatic creation


Tools and methods

Tools and methods

  • natural language processing – NLP

    • Stanford NLP

      • Stanford Parser

      • Stanford POS tagger

      • Stanford Named Entity Recognizer

  • multi-language environment – Google Translate

  • WordNet (synsets)

  • Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG


Processing of text documents

Processing of text documents

<html><body><p>An integer character

constant has type int.</p></body></html>

An/DT integer/NN character/NN

constant/NN has/VBZ type/NN int/NN ./.


Processing of text documents e xtraction cleaning translation

Processing of text documents - extraction, cleaning, translation

  • input TXT, HTML, PDF

  • removal of occurrences of specialcharacters using regular expressions

    • numeric designation of chapters and references

    • removal of single letter prepositions

      (\\s+[^Aa\\s\\.]{1})+\\s+

    • parentheses, dashes, and other

  • translation into English – the tools work only with english text

    • Google Translate


Processing of text documents a nnotation

Processing of text documents - annotation

  • Stanford CoreNLP

    • Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer

    • machine learning over large data, statistical model of maximum entropy

    • learned models included

  • Activities

    • tokenization

    • sentence splitting

    • POS tagging - Part-of-speech

    • lemmatization

    • NER - Named Entity Recognition


Example

Example

<html><body><p>An integer character constant has type int.</p></body></html>

An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.


Mining concepts

Mining concepts

  • tokens marked by POS tagger as nouns are first concept candidates

  • one word or multi-words nouns

  • identifying token as concept by disambiguation from WordNet

    • assigning synset – automatic, manual

    • using domain term for searching

    • possible selection of incorrect synset – with other meaning


Mining relations

Mining relations

  • unoriented / oriented

  • unnamed / named

  • WordNet – concept must have synset

    • hyperonyms and hyponyms – IsA relations

    • holonyms and meronyms – partOf relations

    • relation orientation based on concept order

  • only direct relations

  • from text

    • lexical-syntactic patterns

    • decomposition of multi-word terms – right part of term corresponds to existing concept

      assignment expression

      assignment expression IsA expression

    • sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun

      integral type IsA type


  • T ool

    Tool


    Experiment

    Experiment

    • ANSI/ISO C language

    • comparison with existing manually created ontology

    • 2 experiments

      • all concept candidates

      • only first 200 candidates

      • 3 variants of experiment

        • only candidates

        • candidates and IsA proposals

        • candidates and IsA proposals and NER entities


    First 30 candidates

    First 30 candidates


    Experiment1

    Experiment


    Experiment2

    Experiment

    • Variant of experiment without IsA relations only with NER entities


    Conclusions and further work

    Conclusions and further work

    • concepts => lightweight ontology

      • enables better automatic relations mining


    Contacts

    Contacts

    Petr Šaloun

    FEECS, VSB–Technical University of Ostrava

    petr.saloun@vsb.cz

    Petr Klimánek

    (was: Faculty of Science, University of Ostrava)

    p.klimanek@gmail.com

    Zdenek Velart

    FEECS, VSB–Technical University of Ostrava

    zdenek.velart@gmail.com


  • Login