Model of web clustering engine enrichment with a taxonomy ontologies and user information
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Carlos Cobos-Lozada MSc. Ph.D. (c) [email protected] / [email protected] Advisor: Elizabeth León Ph.D. [email protected] Visiting scholar of Modern Heuristic Research Group LISI-MIDAS: Universidad Nacional de Colombia Sede Bogotá GTI : Universidad del Cauca

Download Presentation

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Carlos Cobos-Lozada MSc. Ph.D. (c)

[email protected] / [email protected]

Advisor: Elizabeth León Ph.D.

[email protected]

Visiting scholar of Modern Heuristic Research Group

LISI-MIDAS: Universidad Nacional de Colombia Sede Bogotá

GTI : Universidad del Cauca

Idaho Falls, October 5, 2011

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information


Agenda

Agenda

  • Preliminaries

  • Latent Semantic Indexing

  • Web Clustering Engines

  • Proposed Model


Preliminaries

Documents

Results

Preliminaries

Information Retrieval System

Auto complete

User

Extended

Query

Query

Retrieval

Process

Indexing

Process

Visualization and browsing

Feedback

Indexes


Preliminaries1

Algebraic

Set Theoretic

Generalized Vector

Lat. Semantic Index

Neural Networks

Structured Models

Fuzzy

Extended Boolean

Non-Overlapping Lists

Proximal Nodes

Classic Models

Probabilistic

boolean

vector space

probabilistic

Inference Network

Belief Network

Preliminaries

Information Retrieval Models

Retrieval


Preliminaries2

Preliminaries

Classic Models – Basic Concepts

  • Each document is represented by a set of representative keywords or index terms

  • An index term is a document word useful for remembering the document main themes

  • Usually, index terms are nouns because nouns have meaning by themselves

  • However, some search engines assume that all words are index terms (full text representation)

  • Not all terms are equally useful for representing the document contents, e.g. less frequent terms allow identifying a narrower set of documents

  • The importance of the index terms is represented by weights associated to them


Preliminaries3

Preliminaries

Indexing Process

recognition of structure

Document

Structure

Full text representation

Tokenization

Filters

Stop words rem.

Noun groups rem.

Vocabulary rest.

Stemming

Key words


Preliminaries4

Preliminaries

Indexing Process - Sample

WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again.

Original

WASHINGTON The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again

Tokens

washington the house of representatives on tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again

Filters

washington house representatives tuesday passed bill puts government stable financial footing weeks resolve battle spending flare

Stop

washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl spend flare

Stem


Preliminaries5

Preliminaries

Indexing Process - Sample

TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race, in a move that sets up a battle between Mitt Romney and Rick Perry.

Original

TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race in a move that sets up a battle between Mitt Romney and Rick Perry

Tokens

trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might make a late leap into the 2012 republican presidential race in a move that sets up a battle between mitt romney and rick perry

Filters

trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012 republican presidential race move sets battle mitt romney rick perry

Stop

trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012 republican presidenti race move set battl mitt romnei rick perri

Stem


Preliminaries6

Preliminaries

TF-IDF or Term-Document Matrix

Stored in an Inverted Index

Observed Frequency

Term-Document

Matrix (TDM)


Preliminaries7

Preliminaries

Cosine Similarity


Preliminaries8

Preliminaries

Sample 1: Vector Space Model

t2

d7

d6

q

d5

d4

d3

t1

d2

t3

d1


Preliminaries9

Preliminaries

Sample 2: Vector Space Model

t2

d7

d6

q

d5

d4

d3

t1

d2

t3

d1


Preliminaries10

Preliminaries

Vector Space Model

  • Advantages:

  • Simple model based on linear algebra

  • Term weights

  • Allows computing a continuous degree of similarity between queries and documents

  • Allows ranking documents according to their possible relevance

  • Allows partialmatching

  • Limitations:

  • Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)

  • Word substrings might result in a "falsepositive match"

  • Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "falsenegative match".

  • The order in which the terms appear in the document is lost in the vector space representation.

  • Assumes terms are statistically independent


Latent semantic indexing

Latent Semantic Indexing

  • It is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text

  • SVD:

    • Also, it can be used to reduce noise in the data (SVD moves data to a reduced dimension)


Latent semantic indexing1

Latent Semantic Indexing

SVD

  • Let A denote an m × n matrix of real-valued data and rank r, where without loss of generality m ≥ n, and therefore r ≤ n.

  • Where:

    • The columns of U are called the left singular and form an orthonormal basis for original columns

      • U is the eigenvectors of DDT (orthogonal)

    • The rows of VT contain the elements of the right singular vectors and form an orthonormal basis for original rows

      • V is the eigenvectors of DTD (orthogonal)

    • Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n


Latent semantic indexing2

Latent Semantic Indexing

0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15

0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47

0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08

0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15

0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01

0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07

0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27

0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62

0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40

0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32

mxn

mxn

Terms

3 5 5 5 4 3 1 2

3 4 5 4 3 3 4 3

3 4 5 3 3 4 3 2

5 5 3 5 4 5 5 5

3 4 4 4 5 4 3 4

4 5 4 5 4 3 5 2

2 5 4 3 3 3 3 2

5 5 4 5 3 5 5 4

5 5 5 5 4 5 4 4

4 3 4 4 1 2 4 3

34,89 0 0 0 0 0 0 0

0 4,63 0 0 0 0 0 0

0 0 3,36 0 0 0 0 0

0 0 0 2,33 0 0 0 0

0 0 0 0 2,21 0 0 0

0 0 0 0 0 1,73 0 0

0 0 0 0 0 0 1,22 0

0 0 0 0 0 0 0 0,35

Docs

nxn

0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29

-0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35

0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38

-0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41

-0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08

-0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44

-0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52

-0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11

nxn


Latent semantic indexing3

Latent Semantic Indexing

  • Using SVD to reduce noise

    • Take r instead of n in matrix Ʃ

    • What value of r? e.g. 90% of Frobenius norm

  • In this case r=5, where r < n (n=8)


Latent semantic indexing4

Latent Semantic Indexing

0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15

0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47

0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08

0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15

0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01

0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07

0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27

0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62

0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40

0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32

mxr

mxn

Terms

3 5 5 5 4 3 1 2

3 4 5 4 3 3 4 3

3 4 5 3 3 4 3 2

5 5 3 5 4 5 5 5

3 4 4 4 5 4 3 4

4 5 4 5 4 3 5 2

2 5 4 3 3 3 3 2

5 5 4 5 3 5 5 4

5 5 5 5 4 5 4 4

4 3 4 4 1 2 4 3

34,89 0 0 0 0 0 0 0

0 4,63 0 0 0 0 0 0

0 0 3,36 0 0 0 0 0

0 0 0 2,33 0 0 0 0

0 0 0 0 2,21 0 0 0

0 0 0 0 0 1,73 0 0

0 0 0 0 0 0 1,22 0

0 0 0 0 0 0 0 0,35

Docs

rxr

0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29

-0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35

0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38

-0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41

-0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08

-0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44

-0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52

-0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11

rxn


Latent semantic indexing5

Latent Semantic Indexing

Sum ← 0

For i ← 0 to n do

Sum ← Sum + Ʃ(i,i)

End for

Percentage← Sum* 0.9 // 90% of Frobenius Norm

r ← 0

Temp ← 0

For i ← 0 to n do

Temp ← temp + S(i, i)

r ← r + 1

IF temp ≥ Percentage then

break

end if

End for

Return r

Value of r?


Latent semantic indexing6

Latent Semantic Indexing

  • Retrieved documents in latent space

    • Documents in the latent space:

    • Terms in latent space:


Latent semantic indexing7

Latent Semantic Indexing

  • Query in the latent space:

  • Cosine similarity


Web clustering engines

Web Clustering Engines


Web clustering engines1

Web Clustering Engines

  • The search aspects where WCE can be most useful in complementing the output of plain search engines are:

    • Fast subtopic retrieval: documents can be accessed in logarithmic rather than linear time

    • Topic exploration.: Clusters provides a high-level view of the whole query topic including terms for query reformulation (particularly useful for informational searches in unknown or dynamic domains)

    • Alleviating information overlook: Users may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages


Web clustering engines2

Web Clustering Engines

  • WDC pose new requirements and challenges to clustering technology:

    • Meaningful labels

    • Computational efficiency (response time)

    • Short input data description (snippets)

    • Unknown number of clusters

    • Work with noise data

    • Overlapping clusters


General model

General Model

Visualization

Search results acquisitions

Cluster construction and labeling

Preprocesing

Query

Features

Snippets

Clusters


Proposed model

Proposed Model

Visualization

Search results acquisitions

Cluster construction and labeling

Preprocesing

Query

Query Expansion

Concepts instead of Terms

Evolutionary approach: Online and Offline

Features

Snippets

Clusters

Feedback

Taxonomy, Ontologies and User Information


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Query Expansion Process

Specific

Ontology

Query by keywords

Extended

Query

Query by keywords

A registered user requests a query (based on keywords in a common graphics interface like Google). He/she receives help on-line (auto complete) based on his/her user profile

User

3. External service

1. Pre-processing and semantic relationship

Auto complete

Dropdown List

General Taxonomy of Knowledge

2. Related Concepts with user profile

Inverted Index of Concepts

Query Expansion Process

Concepts, relations (is-a, is-part-of) and instances

User

Profile

0 … *

1


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Query Expansion Process (B)

Query by keywords

Extended

Query

  • GTK and Specific ontologies are multilingual (collaborative edition process)

  • User profile has:

    • Nodes from GTK used for the user

    • A relation with the Inverted Index of concepts (ontologies), to support rating process:

      • Manage concepts that have been previously evaluated for an ontology specific (good/bad)

User

General Taxonomy of Knowledge

Query Expansion Process

1


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Term-Document Matrix - Observed Frequency -

TDM-OF Building Process

Independent Threads

  • Extended query: Original keyword+ other concepts + selected nodes from GTK (ontologies)

  • In parallel, each web search results is processed:

  • Pre-processing

    • Tokenization

    • Filters (Special characters and lower case)

    • Stop words removal

    • Define the language

    • Stemming (English/ Spanish)

  • For each document, accumulate the observed frequency of each term

  • Mark the document as processed

Google

API

Yahoo!

API

Bing

API

Term-Document

Matrix (Observed

Frequency)

TDM-OF Building Process

2


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Concept-Document Matrix - Observed Frequency - CDM-OF Building Process

Specific

Ontology

In parallel, for each document marked as processed:

Join terms belonging to the same concept in the selected specific ontologies (from extended query)

Accumulate the observed frequency for terms who joined in the same concept

End this process when all web search results are processed - thread synchronization -

Concept-Document

Matrix (Observed

Frequency)

CDM-OF

Building Process

Thread Synchronization

3


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Concept-Document Matrix (CDM) Building Process

Calculate weigh (TF-IDF) of concepts in documents

Concept-Document

Matrix (CDM)

4

CDM-OF

Building Process


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Clustering Process

Three own algorithms

A Hybridization of the Global-Best Harmony Search, with the K-means algorithm

A Memetic Algorithm with Niching Techniques (restricted competition replacement and restrictive mating)

A Memetic Algorithm (Roulette wheel, K-means, and Replace the worst)

All Algorithms:

Define the number of clusters automatically (BIC)

Can use a standard Term-Document Matrix (TDM), Frequent Term-Document Matrix (FTDM), Concept-Document Matrix (CDM) or Frequent Concept-Document Matrix (FTDM)

Test with data sets based on Reuters-21578 and DMOZ

Test by users

Clustered Documents

5

Clustering

Process


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Labeling Process

Statistically Representative Terms:

Initialize algorithm parameters

Building of the "Others” label and cluster

Candidate label induction

Eliminate repeated terms

Visual improving of labels

Frequent Phrases:

Conversion of the representation

Document concatenation

Complete phrase discovery

Final selection

Building of the "Others” label and cluster

Cluster label induction

Overlapping clusters

Clustered Documents and Labeled

6

Labeling

Process


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Visualization and Rating Process

  • On experimentation → for each cluster, the user answered whether or not:

  • (Q1) the cluster label is in general representative of the cluster (much, little, or nothing)

  • (Q2) the cluster is useful, moderately useful or useless.

  • Then, for each document in each cluster, the user answered whether or not:

  • (Q3) the document matches with the cluster (very well matching, moderately matching, or not-matching)

  • (Q4) the document relevance (location) in the cluster was adequate (adequate, moderately suitable, or inadequate).

Clustered Documents and Labeled

Visualization and Rating

Process

User

Profile


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Visualization and Rating Process

Specific

Ontology

On production → the user can answer if each document is useful (relevant) or not

Clustered Documents and Labeled

General Taxonomy of Knowledge

0 … *

Inverted Index of Concepts

Visualization and Rating

Process

User

Profile

User

Profile


Proposed model1

Proposed model


Model of web clustering engine enrichment with a taxonomy ontologies and user information

Collaborative Editing Process of Ontologies

Specific

Ontology

1. Select node (ontology associated)

WordNet

Editor

4. Supported by concepts used for user

2. Edit the ontology

Concepts, synonyms in different languages, relations, instances

3. Supported by general ontologies

General Taxonomy of Knowledge

0 … *

Inverted Index of Concepts

User

Profile

5. Update Index automatically when save

Can be automatically


Model of web clustering engine enrichment with a taxonomy ontologies and user information1

Carlos Cobos-Lozada MSc. Ph.D. (c)

[email protected] / [email protected]

Questions?

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information


  • Login