Aug. 14, 2012 2012 IASLOD

Linking Korean Resources to LOD: Issues in Localization Mun Y. Yi Aug. 14, 2012 2012 IASLOD

Agenda • Project Scope • System Architecture • Silk in Action • Korean Traditional Knowledge Data • Localization Issues

LOD2 Work Packages • The project is structured into twelve consecutively numbered work packages (WPs). WP1 to WP6 are concerned with development of the LOD2 Stack, and WP7 to WP9 are designed to extensively validate and demonstrate the developed technology on the basis of a carefully selected and representative set of demonstrator applications, holding potentially great impact. WP10 (SWC) is devoted to training, awareness and dissemination, WP11 is concerned with exploitation and standardization activities, as well as technical coordination activities with other projects. WP12 is designed for high-level project coordination, reporting to the EC as well as activities related to the resolution of the IPR and maintenance of the Consortium Agreement. LOD2 Work Package

Simplified LOD2 Stack High-Level Architecture • The main result of LOD2 will be the LOD2 Stack, an integrated distribution of aligned tools which support the whole life cycle of Linked Data from creation over enrichment, interlinking, fusing to maintenance. LOD2 Stack Architecture

Project Scope: Tasks & Deliverables • In Task4.1, a semi-automatic machine learning technique will be developed and implemented to simplify the creation of mappings between knowledge bases and the assessment of their quality. • KAIST will contribute to this task by providing a platform for automatic linking with Korean, Chinese, and Japanese RDF resources.

Project Scope: Tasks & Deliverables (Cont’d) • In Task 4.5, methods for fusing data about single concept from multiple different sources will be devised and implemented. • KAIST will work on the fusion of multilingual DBpedia datasets, thus eliminating issues for other multilingual resources.

Phased Approaches • The project has been done in 3 iterative cycles. • Each cycle focuses on specific tasks, and lessons learned will be transferred into the next cycles. • In the 1st cycle, preliminary RDF data was generated. During the second cycle, we localized Silk to support Korean resource linking. The last cycle focuses on enhancing data quality. • 2nd Cycle(~July, 2012) • Implementation of Korean Resource Linking Assistant • Silk Localization • Linking with Silk Framework • Internal publication • 3rd Cycle(~Aug., 2012) • Quality Enhancement • Linking Quality • Publish to the LOD2 cloud • 1st Cycle(~Feb., 2012) • Understanding of the Task Domain • Semantic Web • LOD2 Concept • Software Architecture • Data Model(Relational2RDF) • Pilot Project • Korean Traditional Recipe data

Silk in Action • url: http://lod.kaist.ac.kr/silk-workbench/ • File or SPARQL endpoint can be sources or targets. • Define a project • Define a source & a target • Define a task • Define an output • And then click Open

Silk in Action (Cont’d) • Multiple operators can be used for complex tasks. • Outputs can be displayed or written into a file. • Interim result can be exported as a final result or be used as training data sets for machine learning. • Learned algorithm can be used to generate final links. • Define a source & a target from Property Paths • Define operator(s) • Click GenerateLinks • Click Start

Korean Traditional Knowledge Portal

Korean Traditional Knowledge • Data includes • Food(3,236records) • Food name • Food type • Recipe, ingredients • Cooking process (images) • Medicine, sickness, and treatment (38,121 records) • Agriculture(2,775units) • Life(4,438units)

System Architecture • Proprietary RDFgen for transforming relational model to RDF model • Silk for link generation • Virtuoso triple store for serving RDF RDFgen DBpedia RDF Links Source Data in Relational DB Virtuoso Triple Store Silk LinkCreation Transformation Publication Silk New Korean Similarity Measures Virtuoso triple store Instances RDFgen* Ontology

Key Linking Issues • Data Preprocessing • Address Encoding: URI vs.IRI • Korean String Similarity Measure • Handling Transliterated Data

Data Preprocessing : Mapping Relation to RDF • Our goal is to make the recipes of Korean traditional food open. • Original data from relational database were transformed into tables by object relational mapping. • Related ontologies for recipe: LinkedRecipe.com, www.mindswap.org. • Tool and IngredientPortion are not implemented at this phase.

Handling Non-Latin Data • Resources would be described in non-Latin characters. • Tools are not known whether to support non-Latin characters. Writing Systems of the world today - Wikipedia

Address Encoding • URI is a core component of linked data. • URIs are used as names for things. • URI only allows US-ASCII characters for names of the resource. • W3 Recommendations for URI : UTF-8 Character Set & URI Encoding • Use UTF-8 character sets for URI, and encode special/non-Latin characters using %. • ex) http://ko.wikipedia.org/wiki/%EB%B2%A0%EB%A5%BC%EB%A6%B0 • But it’s hard to understand what it is… • Another W3 Recommendations : IRI(Internationalized Resource Identifier) • ex) http://ko.wikipedia.org/wiki/베를린 • Now we can understand what it means. • But some characters look so similar that chance for spoofing increases. ( ex)    Å

Localization: Silk Workbench Address Encoding • Silk Workbench is GUI interface for the generation of links • Silk Workbench displays encoded URIs ‘as is’ so that it’s hard to understand non-Latin dataset. • Decoding URIs enables non-Latin dataset to be displayed in its native language, so it’s a lot easier to work with.

Localization: Korean String Similarity Measures • Two kinds of Korean resources exist: Resources in Korean and resources in transliterated Korean. • We need to calculate similarity distances for both of them. • Korean alphabet has 14 consonants and 10 vowels (together with consonant clusters and diphthongs). • For resources in Korean • ‘비빔밥’ i.e., Korean DBpedia • Most of the resources in Korea • For resources in transliterated Korean • ‘bibimbap’ i.e., English DBpedia • Most of the resources abroad • Most of the comparators in Silk are based on string comparison • i.e., Levenshtein distance • However, writing systems are different from languages to languages. • So comparators for Latin or Roman alphabets are appropriate for Korean alphabet? • String Similarity Distance Measures for Korean • KorED • GrpSim • OneDSim2 • KorPhoD (Our approach) = (sD-1)*3 + min(pD), sD:Syllable Distance, pD:

Localization: Korean String Similarity Measures (Cont’d) • Several Korean similarity distances exist to reflect the characteristics of Korean alphabet. • We devised a new way to measure based on the distribution of phonemes (KorPhoD). • We implemented KoreanPhonemeDistance operator in Silk and used it to build links among Korean resources. Application of Edit Distance to Korean Resources Comparison of Similarity Measures for Korean : syllable distance, : phoneme distance Performance Comparison • Precision : 1.28% vs. 17.78% (about thirteen times improvement ) • F-score: 0.0223 vs. 0.0896 (Four times more effective finding correct links)

Localization: Transliterated Korean Similarity Measures • Two kinds of transliteration related to Korean: From English to Korean / From Korean to English. • For now, we focus on the transliteration from Korean to English to build links for resources in Korean. • The biggest problem is that there have been various algorithms for transliterating Korean into English so far. • From English to Korean • ‘Digital’ -> ‘디지털’, ‘디지틀’, ‘디지탈’, … • From Korean to English • ‘칼국수’ -> ‘Kalguksu’, ‘Kalguksoo’, ‘Kalgugsoo’, … • Transliteration algorithms for Korean • McCune-Reischauer(1937) : Official standard in the past (from 1984 to 2000) • Uses breves( ˘: indicates a short vowel), apostrophes and diereses(¨: a vowel is sounded in a separate syllable) • Yale(1942) • Revised Romanization(2000) : Current official standard. • Is generally similar to MR, but uses no diacritics or apostrophes, and uses distinct letters for ㅌ/ㄷ(t/d), ㅋ/ㄱ(k/g), ㅊ/ㅈ(ch/j) and ㅍ/ㅂ(p/b), etc. • and probably many more… • We found that many academic and government websites still use MR more. • Silk doesn’t have phonetic similarity measures though… • i.e., Soundex

Localization: Transliterated Korean Similarity Measures (Cont’d) • We compare performance from both string similarity perspective and phonetic similarity perspective. • Levenshtein shows good performance for precision, and Soundex shows good performance for recall. • KoTlit shows good performance for both precision and recall, and we are still optimizing the algorithms. Performance Comparison

Concluding Remarks • Localization issues are important for Asian and other non-Latin countries • Need to develop its own similarity measures – string similarity and phonetic similarity • SILK is likely to become a key linking assistant program for LOD • LOD is a major movement to define the next version of the Internet.

Thank you! • MunYong Yi • KAIST 지식서비스공학과 • http://kslab.kaist.ac.kr • mail: munyi@kaist.ac.kr

Aug. 14, 2012 2012 IASLOD

Aug. 14, 2012 2012 IASLOD

Presentation Transcript

As of: 28 Aug 2012

DrupalCon Munich Aug 2012

Fri day , Aug. 24, 2012

Tuesday Aug. 28, 2012

March 14. 2012

2012. 3. 14

Friday, Aug. 31, 2012

Presented at EME 2012 Aug 17, 2012

Aug 2012

14-Aug-14

May 14, 2012

Reinet van Zyl Aug 2012

August 14, 2012

February 14, 2012

CASE 2012 Aug. 20 -24 , 2012

COLING 2012 DEC 14, 2012

Rick Wilkinson 17 Aug 2012

QM2012, Washington, DC, Aug. 13-Aug. 18, 2012

As of: 28 Aug 2012

Year in review (Aug 2012 – Aug 2013)

14 August 2012