1 / 19

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA. Michael Hartung , Lars Kolb, Anika Groß , Erhard Rahm Database Research Group University of Leipzig. getSim (str1,str2 ).

kalyca
Download Presentation

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, AnikaGroß, Erhard Rahm Database Research Group University of Leipzig getSim(str1,str2) 9th Intl. Conf. on Data Integrationin the Life Sciences (DILS)Montreal, July 2013

  2. Ontologies • Multiple interrelated ontologies in a domain • Example: anatomy • Identify overlapping information between ontologies • Information exchange, data integration purposes, reuse • …  Need to create mappings between ontologies Mouse Anatomy SNOMED NCI Thesaurus UMLS MeSH GALEN FMA

  3. Matching Example • Two ‘small’ anatomy ontologies O and O’ • Concepts with attributes (name, synonym) • Possible match strategy in GOMMA* • Compare name/synonym values of concepts by a string similarity function, e.g., n-gram or edit distance • Two concepts match if one value pair is highly similar 5x4=20 similarity computations MO,O’ = {(c0,c0’),(c1,c1’),(c2,c2’)} * Kirsten, Groß, Hartung, Rahm: GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution. Journal Biomedical Semantics, 2011

  4. Problems • Evaluation of Cartesian product OxO’ • Especially for large life science ontologies • Different strategies: pruning, blocking, mapping reuse, … • Excessive usage of similarity functions • Applied O(|O||O’|) times during matching • How efficient (runtime, space) does a similarity function work? • Experiences from GOMMA • Optimized implementation of n-gram similarity function • Application on massively parallel hardware • Graphical Processing Units (GPU) • Multi-core CPUs getSim(str1,str2)

  5. Trigram (n=3) Similarity with Dice • Trigram similarity • Input: two strings A and B to be compared • Output: similarity sim(A,B) ∈ [0,1] between A and B • Approach • Split A and B into tokens of length 3 • Compute intersect (overlap) between both token sets • Calculate dice metric based on the size of intersect and token sets • (Optional) Assign pre-/postfixes of length 2 (e.g., ##, $$) to A and B before tokenization

  6. Trigram Similarity - Example sim(‘TRUNK’, ‘TRUNCUS’) • Token sets • {TRU, RUN, UNK} • {TRU, RUN, UNC, NCU, CUS} • Intersect • {TRU, RUN} • Dice metric • 22 / (3+5) = 4/8 = 0.5

  7. Naïve Solution • Method • Tokenization (trivial) • Result: two token arrays aTokens and bTokens • Intersect computation with Nested Loop (main part) • For each token in aTokens look for same token in bTokens true: increase overlap counter (and go on with next token in aTokens) • Final similarity (trivial) • 2overlap / (|aTokens|+|bTokens|) 0 1 2 overlap: • aTokens: {TRU, RUN, UNK} 8 • bTokens: {TRU, RUN, UNC, NCU, CUS} #string-comparisons:

  8. “Sort-Merge”-like Solution • Optimization ideas • Avoid string comparisons • String comparisons are expensive especially for equal-length strings (e.g., “equals” of String class in Java) • Dictionary-based transformation of tokens into unique integer values • Avoid nested loop complexity • O(mn) comparisons to determine intersect of token sets • A-priori sorting of token arrays  make use of ordered tokens during comparison (O(m+n), see Sort-Merge join ) • Amortization of sorting  token sets are used multiple times for comparison ‚UNC‘ = ‚UNK‘?  3 = 8? 

  9. “Sort-Merge”-like Solution - Example sim(TRUNK,TRUNCUS) • Tokenization  integer conversion  sorting • TRUNK  {TRU, RUN, UNK}  {1, 2, 3} • TRUNCUS  {TRU, RUN, UNC, NCU, CUS}  {1, 2, 4, 5, 6} • Intersect with interleaved linear scans 0 1 2 overlap: • aTokens: {1, 2, 3} 3 • bTokens: {1, 2, 4, 5, 6} #integer-comparisons:

  10. GPU as Execution Environment • Design goals • Scalability with 100’s of cores and 1000’s of threads • Focus on parallel algorithms • Example: CUDA programming model • CUDA Kernels and Threads • Kernel: function that runs on a device (GPU, CPU) • Many CUDA threads can execute each kernel • CUDA vs. CPU threads • CUDA threads extremely lightweight (little creation overhead, instant switching, 1000’s of threads used to achieve efficiency) • Multi-core CPUs can only consume a few threads • Drawbacks • A-priori memory allocation, basic data structures

  11. Bringing n-gram to GPU • Problems to be solved • Which data structures are possible for input / output? • How to cope with fixed / limited memory? • How can n-gram be parallelized on GPU?

  12. Input- /Output Data Structure • Three-layered index structure for each ontology • ci: concept index representing all concepts • ai: attribute index representing all attributes • gi: gram (token) index containing all tokens • Two arrays to represent top-k matches per concept • A-priori memory allocation possible (array size of k|O|) • Short (2 bytes) instead of float (4 bytes) data type for similarities  reduce memory consumption

  13. Input- /Output Data Structure - Example [1,2] [1,2] [6,7,8] [3,4,5] [6,7,8] [3,4,18,19,20] [9,10] [9,10,21] [11,12,13,14,15,16,17] O: O‘: c0 c1 c2 Input: gi 3 5 0 1 ci ai ai 0 2 5 8 10 17 gi ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 c0‘ c1‘ c2‘ Output: top-k=2 ˄ sim>0.7 sims corrs MO,O‘: c0-c0‘ c1-c1‘ c2-c2‘

  14. Limited Memory / Parallelization • Ontologies and mapping to large for GPU • Size-based ontology partitioning* • Ideal case: one ontology fits completely in GPU memory • Each kernel computes n-gram similarities between one concept of Pi and all concepts in Qj • Re-use of already stored partition in GPU • Hybrid execution on GPU/CPU possible -Q4 -1-Q4 GPUusage Match task GPU Kernel0 MP0,Q3 ReplaceQ3 with Q4 corrs GPU thread … sims ReadMP0,Q4 Kernel|P0|-1 Q4 P0 Q3 P0 CPUusage Global memory ci ci CPU thread(s) ci ci ai ai ai ai gi gi gi gi * Groß, Hartung, Kirsten, Rahm: On Matching Large Life Science Ontologies in Parallel. Proc. DILS, 2010

  15. Evaluation • FMA-NCI match problem from OAEI 2012 Large Bio Task • Three sub tasks (small, large, whole) • With blocking step to be comparable with OAEI 2012 results* • Hardware • CPU: Intel i5-2500 (4x3.30GHz, 8GB) • GPU: Asus GTX660 (960 CUDA cores, 2GB) * Groß, Hartung, Kirsten, Rahm: GOMMA Results for OAEI 2012. Proc. 7th Intl. Ontology Matching Workshop, 2012

  16. Results for one CPU or GPU • Different implementations for Trigram • Naïve nested loop, hash set lookup, sort-merge • Sort-merge solution performs significantly better • GPU further reduces execution times (~20% of CPU time)

  17. Results for hybrid CPU/GPU usage • NoGPU vs. GPU • Increasing number of CPU threads • Good scalability for multiple CPU threads (speed up of 3.6) • Slightly better execution time with hybrid CPU/GPU • One thread required to communicate with GPU

  18. Summary and Future Work • Experiences from optimizing GOMMA’s ontology matching workflows • Tuning of n-gram similarity function • Preprocessing (integer conversion, sorting) for more efficient similarity computation • Execution on GPU • Overcoming GPU drawbacks (fixed memory, a-priori allocation) • Significant reduction of execution times • 104min99sec for FMA-NCI (whole) match task • Future Work • Execution of other similarity functions on GPU • Flexible ontology matching / entity resolution workflows • Choice between CPU, GPU, cluster, cloud infrastructure, …

  19. Thank You for Your Attention !! Questions ?

More Related