1 / 27

WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun,

WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun, and Dennis McLeod. Presented By: Amandeep Singh. Overview. Focus WebSim (Web Based Similarity Matrix) Dealing with Ambiguities

landon
Download Presentation

WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebSim: A Pathway to Unveiling TermRelationships using a Web Search Technology[2006]Seokkyung Chung, Jongeun Jun, and Dennis McLeod Presented By: Amandeep Singh

  2. Overview Focus WebSim (Web Based Similarity Matrix) Dealing with Ambiguities Comparison with other methods

  3. Introduction Key Issues – Represent and extract meaning from information content General purpose vs. Domain specific Ontologies Bottleneck – How to build Ontologies Building Ontologies – Error prone and time consuming Focus of paper – Computation of similarity between terms of ontology creation

  4. Data Preprocessing Data Gathering Web (Google/Yahoo) Users Feature Extraction Information Delivery Ontology Learning & Management Vector Space Model

  5. Feature Extraction for WebSim3 Phase Process

  6. term vs. word

  7. Phase 1 – Retrieval of Web Documents for each term Web Search Engine Google Web APIs

  8. Phase 2 – Preprocess Retrieved Documents HTML Preprocessing Tokenization Stopwords Removal Stemming

  9. Phase 3 Represent each term in Vector Space Model bag-of-words

  10. Calculating weight of each term

  11. Notations for WebSim kth document in Di Set of web pages returned for ti Di lk = |dk| length of dk dk Ni = |Di| size of Di ti ith term Set of features of term ti ` jthfeature of term ti wij = weight of fij fij

  12. Heuristic 1 – Important words occur more frequently within a document than unimportant words. More Notations Let freqij= number of fij’s occurrences in document dk where dkϵ Di Then TF, term frequency, tfijk of fij in dk is defined as , tfijk =

  13. Heuristic 2 – The more times a word occurs throughout the documents in Di, the stronger its predicting power becomes Related with the document frequency (DF) of the word, which is defined as the percentage of the documents that contain this word. The combination of TF and DF introduces a new ranking scheme which calculates the weight of a feature as , wij = nij= number of documents in Di where fij occurs at least once. We keep only the high weighted 80% of the features to reduce the number of features significantly while minimizing the lose of information

  14. Sample features of some terms

  15. Next Step - Dynamic Semantics Closeness between 2 terms Cosine metric – measures similarity of 2 items according to angle between them. Sim1(ti, tj) = Cosine(vi, vj) = Assumption: If a term ti(iPod) and tj(Apple) have some relationships, then the Web pages returned by tiand tjwould be similar. vi vj

  16. Clearly Ambiguous Multiple Meanings Un correlated Web Pages Ontology OIL Source of Energy No parking violators will be towed

  17. Solution ? Query Expansion Combine ti and tj as query What happens if ti and tj are not related? What happens if ti and tj are related?

  18. So, when similarity between ti and tj is not high enough, similarity will be refined as Sim2(ti, tj) = Average(Cosine(ti, tjti), Cosine(tj, titj))

  19. Examples

  20. Candidate term derivation for ontology modification How to identify candidate terms that should be added to an Ontology ? Previous Research - Topic Mining (identified features can be given as an input to WebSim for Ontology enrichment) Feature extraction methodology. An existing term in the ontology can be submitted as query to Web search engine. The obtained features which do not exist in the ontology can be candidates for Ontology enrichment.

  21. Restructuring Ontologies Brute Force – Compute pairwise similarities between all terms O(m2) WebSim approach – Compute Similarity between term and its features. O(m)

  22. WebSim and Semantic Similarity Investigate relatedness between WebSim and existing ontologies like WordNet. Approach 1 : Shortest/average distance in the graph between two terms tj … … ti Assumption: Links represent uniform distances

  23. Alternative Methods Approach 2: Incorporate empirical probability estimates into a taxonomical structure What does it mean ? The information content of a term ti, IC(ti) is defined as , IC(ti) = -log(p(ti)) where p(ti) is the probability of how much a term ti occurs. Equation above states that informativeness decreases as concept probability in- creases. This quantization of information provides a new approach to measure semantic similarity. The more information two terms share, the more similar they are.

  24. Resnik defines the information shared by two terms as the maximum information content of the common parents of the terms in the ontology Resnik(ti, tj) = maxtϵ CP(ti,tj)[-log(p(t)] CP(ti; tj) represents the set of parents terms shared by tiand tj Range: 0 to infinity

  25. Lin’s metric Lin(ti,tj) = Range : 0(dissimilarity) to 1 (similarity)

  26. WebSim vs. Semantic Similarity

  27. It’s Over

More Related