1 / 19

VOCABULARY TERM MAPPING

By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson. VOCABULARY TERM MAPPING. Problem Overview Implementation Details Deliverables Results. OUTLINE. Problem Overview : vocabulary term mapping. Document A. Document B. Parser to parse document format and extract terms.

lana
Download Presentation

VOCABULARY TERM MAPPING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VDC/TWG Meeting August 09 By: NamrataLele Mentors: Dave Vieglais Bruce Wilson VOCABULARY TERM MAPPING

  2. Problem Overview Implementation Details Deliverables Results VDC/TWG Meeting August 09 OUTLINE

  3. VDC/TWG Meeting August 09 Problem Overview :vocabulary term mapping Document A Document B Parser to parse document format and extract terms Measure Semantic Relatedness of terms from documents A and B

  4. Implemented parsers to parse the following metadata documents : • DublinCore • DarwinCore • EML • Deliverables : • DublinCore Parser • DarwinCore Parser • EML Parser VDC/TWG Meeting August 09 Implementation

  5. Measure Semantic Relatedness of terms using the following libraries : • Lucene : It is a full-featured text search engine written entirely in Java, and it is an open source project available for free download • GTM (General Text Matcher) :GTM measures the similarity between texts.GTM is written in Java, and is open source, released under the BSD license. VDC/TWG Meeting August 09 Implementation

  6. Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query Idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query VDC/TWG Meeting August 09 Lucene scoring

  7. VDC/TWG Meeting August 09 Using lucene and wordnet ( vocabulary) Vocabulary Term Mapper Parsed input document A Xml file 1) Read term-description from document A 2) Stem, stop-word filter and expand description (query) for synonyms using Wordnet index Parsed input document B Create xml output file Xml file Output Lucene Index Builder (Stemming and stop-word filtering) Expand description for synonyms using Wordnet index Lucene index Get matching terms (lucene documents) 3) Search term and description in lucene index for document B

  8. Lucene Index Builder Lucene Vocabulary Term Mapper Wordnet Index Builder VDC/TWG Meeting August 09 deliverables

  9. Original Description : the full, unabbreviated name of the country Stop word filtered ,stemmed and synonym expanded description : the full entire fully good total undivided wax wide unabbrevi name advert appoint call cite constitute describe diagnose discover distinguish epithet figure gens identify key list make mention nominate refer of countri VDC/TWG Meeting August 09 Example

  10. 10 VDC/TWG Meeting August 09 Using General Text Matcher ( without thesauri) Vocabulary Term Mapper Parsed input document A Xml file Read term-description from document A (Stem and stop-word filter) Read term-description from document B (Stem and stop-word filter) Get similarity score using GTM Parsed input document B Xml file Output Score, Term , Desc , Xpath to xml file Maintain top 5 scores for every term-description in DocumentA

  11. Modified GTM Library GTM Vocabulary Term Mapper VDC/TWG Meeting August 09 deliverables

  12. The results obtained from various mappings are as follows : • DublinCore – DarwinCore • DarwinCore – DublinCore • EML – DarwinCore • DarwinCore – EML • EML – DublinCore • DublinCore – EML VDC/TWG Meeting August 09 Results from lucene vocabulary term mapping

  13. Following is a list of resulting terms obtained from Lucene Vocabulary Term Mapper which matches with the existing mapping between DublinCore and EML • Title – Title • Creator – Creator • Publisher – Publisher • Format – Physical • Coverage- Coverage • Rights – Intellectual Rights VDC/TWG Meeting August 09 Dublin core elements in eml

  14. The results obtained from various mappings are as follows : • DublinCore – DarwinCore • DarwinCore – DublinCore VDC/TWG Meeting August 09 Results from GTM vocabulary term mapping

  15. VDC/TWG Meeting August 09 Sample output xml file

  16. Fix a bug in EML parser. Provide two versions of EML parser: One that has description for all terms in the hierarchy and one that has description for only the current term. VDC/TWG Meeting August 09 To-dos

  17. Got a chance to learn new libraries (Lucene and GTM) Learnt new concepts about semantic similarity Honed my XML skills Enjoyed working with this team  Conclusion

  18. General Text Matcher http://nlp.cs.nyu.edu/GTM/ http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/package-summary.html VDC/TWG Meeting August 09 references

  19. VDC/CCIT Meeting June 09

More Related