slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Aug. 14, 2012 2012 IASLOD PowerPoint Presentation
Download Presentation
Aug. 14, 2012 2012 IASLOD

Loading in 2 Seconds...

play fullscreen
1 / 23

Aug. 14, 2012 2012 IASLOD - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi. Aug. 14, 2012 2012 IASLOD. Agenda. Project Scope System Architecture Silk in Action Korean Traditional Knowledge Data Localization Issues. LOD2 Work Packages.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Aug. 14, 2012 2012 IASLOD' - kynton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2

Agenda

  • Project Scope
  • System Architecture
  • Silk in Action
  • Korean Traditional Knowledge Data
  • Localization Issues
slide3

LOD2 Work Packages

  • The project is structured into twelve consecutively numbered work packages (WPs). WP1 to WP6 are concerned with development of the LOD2 Stack, and WP7 to WP9 are designed to extensively validate and demonstrate the developed technology on the basis of a carefully selected and representative set of demonstrator applications, holding potentially great impact. WP10 (SWC) is devoted to training, awareness and dissemination, WP11 is concerned with exploitation and standardization activities, as well as technical coordination activities with other projects. WP12 is designed for high-level project coordination, reporting to the EC as well as activities related to the resolution of the IPR and maintenance of the Consortium Agreement.

LOD2 Work Package

slide4

Simplified LOD2 Stack High-Level Architecture

  • The main result of LOD2 will be the LOD2 Stack, an integrated distribution of aligned tools which support the whole life cycle of Linked Data from creation over enrichment, interlinking, fusing to maintenance.

LOD2 Stack Architecture

slide5

Project Scope: Tasks & Deliverables

  • In Task4.1, a semi-automatic machine learning technique will be developed and implemented to simplify the creation of mappings between knowledge bases and the assessment of their quality.
  • KAIST will contribute to this task by providing a platform for automatic linking with Korean, Chinese, and Japanese RDF resources.
slide6

Project Scope: Tasks & Deliverables (Cont’d)

  • In Task 4.5, methods for fusing data about single concept from multiple different sources will be devised and implemented.
  • KAIST will work on the fusion of multilingual DBpedia datasets, thus eliminating issues for other multilingual resources.
slide7

Phased Approaches

  • The project has been done in 3 iterative cycles.
  • Each cycle focuses on specific tasks, and lessons learned will be transferred into the next cycles.
  • In the 1st cycle, preliminary RDF data was generated. During the second cycle, we localized Silk to support Korean resource linking. The last cycle focuses on enhancing data quality.
  • 2nd Cycle(~July, 2012)
  • Implementation of Korean Resource Linking Assistant
  • Silk Localization
  • Linking with Silk Framework
  • Internal publication
  • 3rd Cycle(~Aug., 2012)
  • Quality Enhancement
  • Linking Quality
  • Publish to the LOD2 cloud
  • 1st Cycle(~Feb., 2012)
  • Understanding of the Task Domain
  • Semantic Web
  • LOD2 Concept
  • Software Architecture
  • Data Model(Relational2RDF)
  • Pilot Project
  • Korean Traditional Recipe data
slide8

Silk in Action

  • url: http://lod.kaist.ac.kr/silk-workbench/
  • File or SPARQL endpoint can be sources or targets.
  • Define a project
  • Define a source & a target
  • Define a task
  • Define an output
  • And then click Open
slide9

Silk in Action (Cont’d)

  • Multiple operators can be used for complex tasks.
  • Outputs can be displayed or written into a file.
  • Interim result can be exported as a final result or be used as training data sets for machine learning.
  • Learned algorithm can be used to generate final links.
  • Define a source & a target from Property Paths
  • Define operator(s)
  • Click GenerateLinks
  • Click Start
slide11

Korean Traditional Knowledge

  • Data includes
    • Food(3,236records)
      • Food name
      • Food type
      • Recipe, ingredients
      • Cooking process (images)
    • Medicine, sickness, and treatment (38,121 records)
    • Agriculture(2,775units)
    • Life(4,438units)
slide12

System Architecture

  • Proprietary RDFgen for transforming relational model to RDF model
  • Silk for link generation
  • Virtuoso triple store for serving RDF

RDFgen

DBpedia

RDF Links

Source Data

in Relational DB

Virtuoso

Triple Store

Silk

LinkCreation

Transformation

Publication

Silk

New Korean Similarity Measures

Virtuoso triple store

Instances

RDFgen*

Ontology

slide13

Key Linking Issues

  • Data Preprocessing
  • Address Encoding: URI vs.IRI
  • Korean String Similarity Measure
  • Handling Transliterated Data
slide14

Data Preprocessing : Mapping Relation to RDF

  • Our goal is to make the recipes of Korean traditional food open.
  • Original data from relational database were transformed into tables by object relational mapping.
  • Related ontologies for recipe: LinkedRecipe.com, www.mindswap.org.
  • Tool and IngredientPortion are not implemented at this phase.
slide15

Handling Non-Latin Data

  • Resources would be described in non-Latin characters.
  • Tools are not known whether to support non-Latin characters.

Writing Systems of the world today - Wikipedia

slide16

Address Encoding

  • URI is a core component of linked data.
  • URIs are used as names for things.
  • URI only allows US-ASCII characters for names of the resource.
  • W3 Recommendations for URI : UTF-8 Character Set & URI Encoding
    • Use UTF-8 character sets for URI, and encode special/non-Latin characters using %.
    • ex) http://ko.wikipedia.org/wiki/%EB%B2%A0%EB%A5%BC%EB%A6%B0
    • But it’s hard to understand what it is…
  • Another W3 Recommendations : IRI(Internationalized Resource Identifier)
    • ex) http://ko.wikipedia.org/wiki/베를린
    • Now we can understand what it means.
    • But some characters look so similar that chance for spoofing increases. ( ex)    Å
slide17

Localization: Silk Workbench Address Encoding

  • Silk Workbench is GUI interface for the generation of links
  • Silk Workbench displays encoded URIs ‘as is’ so that it’s hard to understand non-Latin dataset.
  • Decoding URIs enables non-Latin dataset to be displayed in its native language, so it’s a lot easier to work with.
slide18

Localization: Korean String Similarity Measures

  • Two kinds of Korean resources exist: Resources in Korean and resources in transliterated Korean.
  • We need to calculate similarity distances for both of them.
  • Korean alphabet has 14 consonants and 10 vowels (together with consonant clusters and diphthongs).
  • For resources in Korean
    • ‘비빔밥’ i.e., Korean DBpedia
    • Most of the resources in Korea
  • For resources in transliterated Korean
    • ‘bibimbap’ i.e., English DBpedia
    • Most of the resources abroad
  • Most of the comparators in Silk are based on string comparison
    • i.e., Levenshtein distance
    • However, writing systems are different from languages to languages.
    • So comparators for Latin or Roman alphabets are appropriate for Korean alphabet?
  • String Similarity Distance Measures for Korean
    • KorED
    • GrpSim
    • OneDSim2
    • KorPhoD (Our approach) = (sD-1)*3 + min(pD), sD:Syllable Distance, pD:
slide19

Localization: Korean String Similarity Measures (Cont’d)

  • Several Korean similarity distances exist to reflect the characteristics of Korean alphabet.
  • We devised a new way to measure based on the distribution of phonemes (KorPhoD).
  • We implemented KoreanPhonemeDistance operator in Silk and used it to build links among Korean resources.

Application of Edit Distance to Korean Resources

Comparison of Similarity Measures for Korean

: syllable distance, : phoneme distance

Performance Comparison

  • Precision : 1.28% vs. 17.78% (about thirteen times improvement )
  • F-score: 0.0223 vs. 0.0896 (Four times more effective finding correct links)
slide20

Localization: Transliterated Korean Similarity Measures

  • Two kinds of transliteration related to Korean: From English to Korean / From Korean to English.
  • For now, we focus on the transliteration from Korean to English to build links for resources in Korean.
  • The biggest problem is that there have been various algorithms for transliterating Korean into English so far.
  • From English to Korean
    • ‘Digital’ -> ‘디지털’, ‘디지틀’, ‘디지탈’, …
  • From Korean to English
    • ‘칼국수’ -> ‘Kalguksu’, ‘Kalguksoo’, ‘Kalgugsoo’, …
  • Transliteration algorithms for Korean
    • McCune-Reischauer(1937) : Official standard in the past (from 1984 to 2000)
      • Uses breves( ˘: indicates a short vowel), apostrophes and diereses(¨: a vowel is sounded in a separate syllable)
    • Yale(1942)
    • Revised Romanization(2000) : Current official standard.
      • Is generally similar to MR, but uses no diacritics or apostrophes, and uses distinct letters for ㅌ/ㄷ(t/d), ㅋ/ㄱ(k/g), ㅊ/ㅈ(ch/j) and ㅍ/ㅂ(p/b), etc.
    • and probably many more…
    • We found that many academic and government websites still use MR more.
  • Silk doesn’t have phonetic similarity measures though…
    • i.e., Soundex
slide21

Localization: Transliterated Korean Similarity Measures

(Cont’d)

  • We compare performance from both string similarity perspective and phonetic similarity perspective.
  • Levenshtein shows good performance for precision, and Soundex shows good performance for recall.
  • KoTlit shows good performance for both precision and recall, and we are still optimizing the algorithms.

Performance Comparison

slide22

Concluding Remarks

  • Localization issues are important for Asian and other non-Latin countries
  • Need to develop its own similarity measures – string similarity and phonetic similarity
  • SILK is likely to become a key linking assistant program for LOD
  • LOD is a major movement to define the next version of the Internet.
slide23

Thank you!

  • MunYong Yi
  • KAIST 지식서비스공학과
  • http://kslab.kaist.ac.kr
  • mail: munyi@kaist.ac.kr