1 / 22

Scalable Semantic Search Embedding for Improved Task Performance

This research paper discusses an innovative approach to scalable semantic search embedding, highlighting its potential benefits in information retrieval, entity disambiguation, de-duplication, recommendation, clustering, and subject prediction. The paper introduces embedding approaches, including global co-occurrence count-based methods and local context predictive methods, and explores the use of random projection for entity and document embeddings. The effectiveness of the approach is evaluated using benchmarks and experiments. The authors emphasize the need for human involvement and appropriate evaluation measures in embedding efforts.

costanzo
Download Presentation

Scalable Semantic Search Embedding for Improved Task Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AIDR2019 • May 13, 2019 An innovative approach to scalable semantic search embedding Shenghui Wang, Rob Koopman, Titia van der Werfand Jean Godby OCLC Research

  2. Why semantic embedding? Many of our tasks could be improved by semantic embedding • Information retrieval • Entity disambiguation • De-duplication • Recommendation • Clustering • Subject prediction • ...

  3. Foundation for semantic embedding • Distributional Hypothesis (Harris, 1954) “words that occur in similar contexts tend to have similar meanings” • Statistical Semantics (Weaver, 1955, Firth 1957) “a word is characterized by the company it keeps”

  4. An example by Stefan Evert: what is bardiwac? • He handed her her glass of bardiwac. • Beef dishes are made to complement the bardiwacs. • Nigel staggered to his feet, face flushed from too much bardiwac. • Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’ssunshine. • I dined on bread and cheese and this excellent bardiwac. • The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish. ⇒ ‘bardiwac’ is a heavy red alcoholic beverage made from grapes

  5. Word embedding • Words are represented as vectors of numbers that represent the meaning of a word. • Semantically similar words are mapped to nearby points, that is, “are embedded nearby each other” • A desirable property: computable similarity https://medium.com/@jayeshbahire/introduction-to-word-vectors-ea1d4e4b84bf

  6. Embedding approaches • Global co-occurrence count-based methods (e.g. LSA) • Based on statistics of how often some word co-occurs with its neighbour words in a large text corpus • Dimension reduction methods • Local context predictive methods (e.g. word2vec) • Learning to predict a word from its neighbours or vice versa • More complex and powerful deep learning models for embedding words, sentences and documents

  7. Why not deep learning methods? • Computationally expensive to train from scratch • Often requiring GPUs • Difficult to find the optimal hyperparameter settings • Pre-trained word embeddings may not capture domain-specific semantics • Medical information retrieval, special collection exploration, etc. • Standard benchmarks and evaluation methods often do not answer practical needs

  8. Embedding based on Random Projection (RP)

  9. After random projection • Each entity (words, subjects, authors, …) is embedded as a D-dimensional vector (in our case, a 256-byte vector) • Each document is also embedded as a vector in the same semantic space • A document is represented as the weighted average of the vectors of its associated entities • Cosine similarity reflects semantic similarity

  10. What’s special in Ariadne RP • Entity embeddings are updated online while going through the corpus once • No need to store the original co-occurrence matrix • No iterations over the corpus, highly efficient implementation

  11. Orthogonal projection and weight adjustment • Vectors are projected on the orthogonal hyperplane to an average language vector • Removing the stop-wordiness improves the discriminating power • Weights are calculated automatically • The more similar to the average vector, the less weight it gets • No need to remove stop words first • Crucial to get distinctive document embeddings

  12. Semantic Textual Similarity (STS) benchmark

  13. Effect of orthogonal projection and weight adjustment Rob Koopman, Shenghui Wang and Gwenn Englebienne. Fast and discriminative semantic embedding. Proceedings of The 13th International Conference on Computational Semantics (IWCS 2019). To appear.

  14. Automatic subject prediction • Naive similarity based method • Subjects and documents are embedded in the same semantic space • A document is likely to be indexed with subjects that are most related to it (with highest cosine similarities)

  15. Experiment: Predicting MeSH subjects • Metadata of one million Medline articles with abstract • The training set contains 147,837 unique MeSH subjects, on average 16 per article • 10,000 articles for testing • Measure precision/recall of top N predictions

  16. Recall and precision @ n: Medline Rob Koopman, Shenghui Wang and Gwenn Englebienne. Fast and discriminative semantic embedding. Proceedings of The 13th International Conference on Computational Semantics (IWCS 2019). To appear.

  17. Example https://www.ncbi.nlm.nih.gov/pubmed/14670424

  18. Comparing with FastText, Sent2Vec

  19. Summary • Along with deep learning efforts, we can take an alternative and practical approach for a fast and discriminative semantic embedding • We need human involvement as well as more appropriate evaluation measures

  20. Thank you Shenghui Wang (shenghui.wang@oclc.org) Rob Koopman (rob.koopman@oclc.org) Titia van der Werf (titia.vanderwerf@oclc.org) Jean Godby (godby@oclc.org)

More Related