WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun,

WebSim: A Pathway to Unveiling TermRelationships using a Web Search Technology[2006]Seokkyung Chung, Jongeun Jun, and Dennis McLeod Presented By: Amandeep Singh

Overview Focus WebSim (Web Based Similarity Matrix) Dealing with Ambiguities Comparison with other methods

Introduction Key Issues – Represent and extract meaning from information content General purpose vs. Domain specific Ontologies Bottleneck – How to build Ontologies Building Ontologies – Error prone and time consuming Focus of paper – Computation of similarity between terms of ontology creation

Data Preprocessing Data Gathering Web (Google/Yahoo) Users Feature Extraction Information Delivery Ontology Learning & Management Vector Space Model

Feature Extraction for WebSim3 Phase Process

term vs. word

Phase 1 – Retrieval of Web Documents for each term Web Search Engine Google Web APIs

Phase 2 – Preprocess Retrieved Documents HTML Preprocessing Tokenization Stopwords Removal Stemming

Phase 3 Represent each term in Vector Space Model bag-of-words

Calculating weight of each term

Notations for WebSim kth document in Di Set of web pages returned for ti Di lk = |dk| length of dk dk Ni = |Di| size of Di ti ith term Set of features of term ti ` jthfeature of term ti wij = weight of fij fij

Heuristic 1 – Important words occur more frequently within a document than unimportant words. More Notations Let freqij= number of fij’s occurrences in document dk where dkϵ Di Then TF, term frequency, tfijk of fij in dk is defined as , tfijk =

Heuristic 2 – The more times a word occurs throughout the documents in Di, the stronger its predicting power becomes Related with the document frequency (DF) of the word, which is defined as the percentage of the documents that contain this word. The combination of TF and DF introduces a new ranking scheme which calculates the weight of a feature as , wij = nij= number of documents in Di where fij occurs at least once. We keep only the high weighted 80% of the features to reduce the number of features significantly while minimizing the lose of information

Sample features of some terms

Next Step - Dynamic Semantics Closeness between 2 terms Cosine metric – measures similarity of 2 items according to angle between them. Sim1(ti, tj) = Cosine(vi, vj) = Assumption: If a term ti(iPod) and tj(Apple) have some relationships, then the Web pages returned by tiand tjwould be similar. vi vj

Clearly Ambiguous Multiple Meanings Un correlated Web Pages Ontology OIL Source of Energy No parking violators will be towed

Solution ? Query Expansion Combine ti and tj as query What happens if ti and tj are not related? What happens if ti and tj are related?

So, when similarity between ti and tj is not high enough, similarity will be refined as Sim2(ti, tj) = Average(Cosine(ti, tjti), Cosine(tj, titj))

Examples

Candidate term derivation for ontology modification How to identify candidate terms that should be added to an Ontology ? Previous Research - Topic Mining (identified features can be given as an input to WebSim for Ontology enrichment) Feature extraction methodology. An existing term in the ontology can be submitted as query to Web search engine. The obtained features which do not exist in the ontology can be candidates for Ontology enrichment.

Restructuring Ontologies Brute Force – Compute pairwise similarities between all terms O(m2) WebSim approach – Compute Similarity between term and its features. O(m)

WebSim and Semantic Similarity Investigate relatedness between WebSim and existing ontologies like WordNet. Approach 1 : Shortest/average distance in the graph between two terms tj … … ti Assumption: Links represent uniform distances

Alternative Methods Approach 2: Incorporate empirical probability estimates into a taxonomical structure What does it mean ? The information content of a term ti, IC(ti) is defined as , IC(ti) = -log(p(ti)) where p(ti) is the probability of how much a term ti occurs. Equation above states that informativeness decreases as concept probability in- creases. This quantization of information provides a new approach to measure semantic similarity. The more information two terms share, the more similar they are.

Resnik defines the information shared by two terms as the maximum information content of the common parents of the terms in the ontology Resnik(ti, tj) = maxtϵ CP(ti,tj)[-log(p(t)] CP(ti; tj) represents the set of parents terms shared by tiand tj Range: 0 to infinity

Lin’s metric Lin(ti,tj) = Range : 0(dissimilarity) to 1 (similarity)

WebSim vs. Semantic Similarity

It’s Over

WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun,

WebSim: A Pathway to Unveiling Term Relationships using a Web Search Technology [2006] Seokkyung Chung, Jongeun Jun,

Presentation Transcript

Search and Agent

PENCIL SKETCHING

Love and Relationships

Search Engine Marketing 101

Mary McLeod Bethune (1875-1955)

Unveiling The Mystery of the GODHEAD

Content Pathway Studio overview Pathway Studio usage Methods and tools used by Pathway Studio

Sector Search Pattern

Turning the Technological Table on Your Students with Their Own Technology

E-government

Semantic Search Engines – On the Way to Web 3.0

Chung-Wen Kao Chung-Yuan Christian University Taiwan

Search Engine Technology 2/10

Semantic Search Engines – On the Way to Web 3.0

Computing with Pathway/Genome Databases

Personalized Web Search using Clickthrough History

Global Climate Classification and Vegetation Relationships

Fuel cell technology and rechargeable batteries

Motion from image and inertial measurements