1 / 26

Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology

NLP&DBPEDIA 2015 WORKSHOP. Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology. Saemi Jang , Megawati , Jiyeon Choi, and Mun Yong Yi KIRD, KAIST. Motivation. DBpedia extracts structured information from Wikipedia example: Wikipedia page on Pope Saint Felix III.

macdonalda
Download Presentation

Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP&DBPEDIA 2015 WORKSHOP Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST

  2. Motivation • DBpedia • extracts structured information from Wikipedia • example: Wikipedia page on Pope Saint Felix III dbpedia:Pope_Felix_III dbo:birthPlace dbpedia:Rome dbo:deathPlace dbpedia:Odoacer

  3. Motivation • Errors in DBpedia • Incorrect data: type, datatype, value • Ambiguity: URI, property • Quality of the data has become important dbpedia:Pope_Felix_III dbo:birthPlace rdf:type dbo:Place dbpedia:Rome dbo:deathPlace rdf:type dbo:Person dbpedia:Odoacer Error

  4. Motivation • Data Quality Assessment • TripleCheckMate[3], LinkQA[6], WIQA[7], DaCura[8] • Based on ontology that is built from target data (e.g. DBpedia) • But • It is not feasible to use for data having no ontology • Ontology generation is a difficult and time consuming work • Automatic ontology generation works for English and limited domains

  5. Introduction • Goal • Quality assessment of linked data without requiring ontology • Idea • a large portion of the data in a knowledge resource is valid data • Analyze the data patterns in resource, take the patterns appearing frequently • Evaluate the quality based on the patterns

  6. Overview of approach

  7. Quality Assessment Criteria • Data Quality Test Pattern (DQTP) • DQTP = tuple(V,S) • V is a set of typed pattern variables, S is a SPARQL query templet • RDF triples (subject, predicate, object) • Domain is all possible types which can be contained by the subject • Range is all possible types that can be contained by the object • Literal values ensures a certain data type determined by the property used

  8. Test Case Pattern Generation Algorithm Example: Range pattern (dbo:deathPlace) Check the pattern in knowledge resource Compute appearance ratio of each pattern Select top k pattern & Compute ratio Set threshold (average of top k ratio) Build test case pattern Knowledge Resource Average of top 5 ratio = Threshold (e.g. 17%) STEP 1 STEP 2 STEP 3 STEP 4 STEP 5

  9. Evaluation of approach • Test Case Pattern Generation • Compare the approach patterns and the benchmark patterns • Approach generate patterns without using ontology • Benchmark generate patterns using ontology • Quality Assessment Accuracy • Evaluate a localized DBpedia which does not have ontology

  10. Validation 1) Test Case Pattern Generation • Ground truth • RDFUnit[4] compiled a library of data quality test case patterns for quality assessment • Ontology of English DBpedia • Definition of Test Case Patterns

  11. Validation 1) Test Case Pattern Generation • Data • Test Case Pattern Generation • Top 5 type average ratio is 22% for DQP, 17% for RQP • For TQP, most of the triples has a single data pattern • It generate patterns by triples in DBpedia, but RDFUnit using ontology DBpedia 2015 ( dbo,dbp)

  12. Validation 1) Test Case Pattern Generation 99.2 89.4 97.8 80.2 99.0 67.7 DQP RQP TQP A: Pattern generation rate B: pattern generation accuracy of approach Total number of generated patterns with approach Total number of consistent patterns with approach Total number of patterns with benchmark

  13. Validation 1) Test Case Pattern Generation 99.2 89.4 97.8 80.2 99.0 67.7 DQP RQP TQP In case of TQP, the patterns have equivalent meanings with RDFUnit. But they comes from different resources. e.g. rdf:langString, xsd:String

  14. Validation 2) Quality Assessment Accuracy • How to validate the quality assessment accuracy? • Localized version of DBpedia in 125 languages do not have their ontologies • Most of the label of DBpedia Ontology is composed of English label Approach is able to handle a localized DBpedia and evaluate the quality of data

  15. Validation 2) Quality Assessment Accuracy • Data • Localized version of DBpedia (Korean DBpedia) • 32 million triples with 18617 different properties • 1070 localized properties that are carried by more than 100 triples • Test Case Pattern Generation • Top 5 type average ratio is 18% for DQP, 16% for RQP • For TQP, most of the triples has a single data pattern, not only datatype but also language tag (e.g. @en) Korean DBpedia 2015

  16. Validation 2) Quality Assessment Accuracy • Result of DataQuality Assessment • 1438 test case patterns generated by 1070 properties • 1.4 million triples tested from Korean Dbpedia

  17. Validation 2) Quality Assessment Accuracy • Gold standarddata • Randomly selected 1000 triples (95% confidence, 3.5% error) • 2 human evaluator (kappa 0.7207) • Annotate correct type of subject, object based on predicate • Evaluation measure • Precision, recall, and f1-measure • Accuracy

  18. Validation 2) Error Analysis • Error Analysis on Korean DBpedia • The error occurrence rate of total triple is 36.31% • The most error cases is rdf:range violation[3,4,18] • Literal or string data, not URI • Object range validation cannot be performed[4] Error 36.31% Pass 63.69%

  19. Validation 2) Error Analysis • Error Analysis on Korean DBpedia • Incorrect datatype setting e.g. the date must be set as xs:date, but it is set to xs:integer • Incorrect object value e.g. Object value of prop-ko:활동기간(=active period) is a period of time, but only the beginning point of the duration • Property ambiguity e.g. prop-ko:종목(event) can have 2 totally different types on object - the name of event or the number of events

  20. Limitations • Lack of specific domain/range setting e.g. • Quality assessment with only one triple e.g. dbo:birthDate 1958-08-29 (xsd:date) dbpedia:Michael_Jackson dbo:deathDate 1009-06-25 (xsd:date) dbo:birthdatehas to be earlier then dbo:deathDate

  21. Conclusion • Semi-automatically generates patterns from knowledge resource • Patterns are instantiated into test cases to measure the quality of data • more than 97% patterns are generated by approach • This work opens a new possibility of conducting quality assessment without requiring ontology • It can apply to any language and any domain

  22. Ongoing works • Utilizing external resources e.g. WordNet, Thesaurus • Pattern expansion • Create a complete validation system for determining trustworthiness

  23. Questions?

  24. Reference • Linked data quality assessment [2] Quality assessment methodologies for linked open data. Zaveri, A. et al. Submitted to Semantic Web Journal (2013) [5] Weaving the pedantic web. Hogan, A. et al. (2010) [6] Assessing linked data mappings using network measures. Guéret et al. In The Semantic Web: Research and Applications (pp. 87-102). Springer Berlin Heidelberg (2012) [8] Improving curated web-data quality with structured harvesting and assessment. Feeney et al. International Journal on Semantic Web and Information Systems (IJSWIS), 10(2), 35-62 (2014) [16] Swiqa-a semantic web information quality assessment framework. Fürberet al. In ECIS (Vol. 15, p. 19) (2011) [17] Using semantic web resources for data quality management. Fürber et al. In Knowledge Engineering and Management by the Masses (pp. 211-225). Springer Berlin Heidelberg (2010)

  25. Reference • Data Quality Assessment of DBpedia [3] User-driven quality evaluation of dbpedia. Zaveri, A. et al. In Proceedings of the 9th International Conference on Semantic Systems (pp. 97-104). ACM (2013) [4] Test-driven evaluation of linked data quality. Kontokostas et al.In Proceedings of the 23rd international conference on World Wide Web (pp. 747-758). ACM (2014) [18] Crowdsourcing linked data quality assessment. Acosta et al.In The Semantic Web{ISWC 2013 (pp. 260-276). Springer Berlin Heidelberg (2013) [19] Detecting incorrect numerical data in dbpedia. Wienand et al. In The Semantic Web: Trends and Challenges (pp. 504-518). Springer International Publishing (2014) [20] DL-Learner: learning concepts in description logics. Lehmann, J.The Journal of Machine Learning Research, 10, 2639-2642 (2009) • Automatic Ontology generation [13] Automatic ontology generation using schema information. Sie et al. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on (pp.526-531). IEEE (2006) [14] Text2Onto. Cimiano et al. In Natural language processing and information systems (pp. 227-238). Springer Berlin Heidelberg (2005) [21] Automatic generation of OWL ontology from XML data source. Yahia et al. arXivpreprint arXiv:1206.0570 (2012) [24] A robust approach to aligning heterogeneous lexical resources. Pilehvar et al. AP A 1 (2014): c2.

More Related