1 / 29

Treatment of Semantic Heterogeneity ...

Treatment of Semantic Heterogeneity. ... using Meta-Data Extraction and Query Translation. Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002. Outline. What is semantic heterogeneity? Meta-Data extraction Semantic relations Query translation Outlook.

tim
Download Presentation

Treatment of Semantic Heterogeneity ...

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treatment of Semantic Heterogeneity ... ... using Meta-Data Extraction and Query Translation Robert StrötgenSocial Science Information Centre, Bonn euroCRIS 2002, 29th August 2002

  2. Outline • What is semantic heterogeneity? • Meta-Data extraction • Semantic relations • Query translation • Outlook

  3. Project CARMEN • Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures) • Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents) • Methods for treatment of resisting semantic heterogeneity in CARMEN

  4. Semantic Heterogeneity • Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMEN • Semantic heterogeneity appears in different data collections using • different thesauri or classifications for content description • varying or no metadata at all • or when intellectually indexed documents meet completely un-indexed Internet pages

  5. Material: Social Sciences • SOLIS/FORIS vs. Internet documents from social sciences • specialized documentation databases with high-quality content description like abstract, controlled keywords and classification • Internet documents in the majority of cases without any metadata, high semantic and formal heterogeneity

  6. Extraction of Meta-Data

  7. Meta-Data in Test Corpus • Size: 3,661 documents • File format: only HTML documents • TITLE: • Correct title tags: 96 % • Title, but incorrectly coded: 17.7 % of the rest • KEYWORD: • Correct keyword tags: 25.5 % • ABSTRACT: • Correct description tags: 21 % • Abstract, but incorrectly coded: 39,4 % of the rest

  8. Extraction from HTML files - Some Problems • Missing or irregular use of Meta tags (author, keywords, DC-Tags) • Inconsistent use of semantic HTML tags (title, h1, h2, address etc.) • Irregular formatting style for context information (type size, type style, horizontal orientation etc.) • Missing context information (date, author, institution, etc.) • Not specification consistent use of HTML!

  9. Converting HTML  XML • Advantages: • (syntactical) homogenisation of HTML files • XML allows the use of many existing tools for document analysis, particularly the query language XPath. • Disadvantage: • Poor performance of the converting process(not a big issue: extraction runs during gathering process, not at retrieval time)

  10. HTML Heuristic : Title (part) • If (<title>-tag exists && <title> does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in <title> */ If (<title>==HMAX) { <1> Title[1]=<title> } elsif (<title> contains HMAX) { /* ' contain' does always mean case insensitive substring */ <2> Title[0,8]=<title> } elsif (HMAX contains <title>) { <3> Title[0,8]=HMAX } else { <4> Title[0,8]=<title> + HMAX } } elsif (<title> exists && S exists) { /* i.e. <title> exists AND an item //p/b, //i/p etc. exists */ <5> Title[0,5]=<title> + S } elsif (<title> exits) { <6> Title[0,5]=<title> } elsif (<Hx> exits) { <7> Title[0,3]=HMAX } elsif (S exits) { <8> Title[0,1]= S }}

  11. Results and Outlook • Extraction of Meta-Data • TITEL: 80 % extracted with medium or high quality • KEYWORDS: nearly 100 % extracted with high quality • ABSTRACTS: 90 % extracted with medium/high quality • Conclusion • In principle transferable on other domains • Expensive maintenance • Only compromise solution, until builders of web pages use Dublin Core or other Meta-Data standard

  12. Semantic Relations • Intellectual transfers relations(Cross-Concordances) • Tools for creation: SIS-TMS for thesauri, CarmenX for classifications • Statistical transfer relations (Co-occurrence analysis)

  13. Cross-Concordances in SIS-TMS

  14. SIS-TMS Correlation Editor

  15. Parallel Corpus

  16. Corpus with Internet Documents • Social Sciences‘ Internet documents are not indexed using a thesaurus or classification

  17. Simulating a Parallel Corpus

  18. Result: Simulated Parallel Corpus

  19. Term-Term-Matrix

  20. Tool: Jester • Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus

  21. Query Transformation

  22. Binding of Query Languages • Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.

  23. CARMEN Transfer Architecture • Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer service • Exchange of partial queries using XML/XIRQL • Transfer service runs as TomCat servlet server

  24. Evaluation of Transfer Modules • Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer) • Limitation: no use of weight information of transfer relations • Tested transfer: SOLIS/IZ-Thesaurus  SoWi Internet documents/free-terms • Comparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer • 2 exemplary searches per 3 domains (women studies, migration, sociology of industry)

  25. Exemplary Search: “Dominanz“ • „Dominanz“ (“dominance“): 16 relevant documents • 10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste):14 additive documents, thereof 7 relevant (50%, increase 44%) • Precision: 77%

  26. Exemplary Search: „Leiharbeit“ • „Leiharbeit“ (“temporary work“): 10 relevant documents • 4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung):10 additive documents, thereof 2 relevant (20%, increase 20%) • Precision: 60%

  27. Results • All exemplary searches using transfers leads to additive relevant documents compared with a search without transfer • Quota of relevant documents from all new documents between 13% and 55% • Transfer terms not always evident (Example „Wüste“ (“desert”)) • Partly very many transfer terms (user parametrizing or better algorithms needed)

  28. Outlook (What needs to be done?) • Improvement of dubble corpora: • Kind of documents • Diversity of document types • Diversity of institutions / web sites • Domain • Corpus size • Comparison of transfers using statistical relations intellectual relations • Improvement of algorithms • Effect of interactive, repetitive retrieval and user parametrizing / adjustment • User tests

  29. Exploitation • Services (transfer) • Software (Java classes) • Projects: • Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz) • European Schools Treasury Browser (ETB) • Informationsverbund Bildung – Sozialwissenschaften – Psychologie (InfoConnex) • Contact: soe@bonn.iz-soz.de

More Related