slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Input Documents PowerPoint Presentation
Download Presentation
Input Documents

Loading in 2 Seconds...

play fullscreen
1 / 1

Input Documents - PowerPoint PPT Presentation

  • Uploaded on

Document Categorization and Related Concepts Prediction using Wikipedia Articles, Category Network and Page Links Graph Zareen Saba Syed Wikipedia The Free Encyclopedia.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Input Documents

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Document Categorization and Related Concepts Prediction using

Wikipedia Articles, Category Network and Page Links Graph

Zareen Saba Syed


The Free Encyclopedia

Wikipedia is a free online encyclopedia with an exponential growth rate and has developed into probably the largest freely available knowledge base. The size and coverage of wikipedia has reached a limit where it may be used to identify the topics discussed in a document. Research has shown that a simple algorithm using only the wikipedia categories and document titles has been capable of characterizing documents quite well.

Problem Statement:

To Categorize and Predict the Concepts Related to Documents using a General Ontology.


1. To investigate the use of wikipedia category network as a general ontology to categorize documents.

2. To investigate the use of wikipedia articles as concepts and article links as relations between concepts for concept prediction.


Given a set of documents, the wikipedia article index will be used to find top 'n' similar documents. The top 'n' similar documents will serve as the initial set of activated nodes in Spreading Activation on Wikipedia Category Graph and the Wikipedia Page Links Graph. The output will be the title of the highest activated node, i.e, the “Category” in case of the Wikipedia Category Graph and the “ Related Concept” in case of the Wikipedia Page Links Graph.

Spreading Activation:

Node Input Function:


Oi : Output of Node i connected to node j

Wij : Weight on edge from node i to node j

Node Output Function:


Aj : Activation of Node j

k : Pulse No.

Dj : Out Degree of Node j

Edge Weights for Wikipedia Category Links Graph:

In case of wikipedia category links graph unit edge weights have been used.

Edge Weights for Wikipedia Page Links Graph:

Wikipedia articles may be heavily linked and articles may contain links to pages that may not be relevant to the topic of the article, for eg. articles in which a name of a country appears may have that name linked to the wikipedia page for that country and articles that mention a term may have a link to a page defining that term, such links may not be directly related to the title or concept of the original article. Therefore, we have used lucene similarity score between pair of linked articles as edge weight for spreading activation as well as to filter out links where similarity measure is below a threshold.

Use Cases

1. Improved Information Retrieval

Categorizing corpus documents based on a general user developed folksonomy (wikipedia category network) would improve information retrieval tasks for common users.

2. Business Intelligence and Advertising

Knowing which web pages the user has looked at can give an idea about the generalized interest of the user and aid in targeting.

3. Enterprise Content Management

Organizing documents using the wikipedia concepts and ontology can help in improving the existing content management systems.

4. Aid in User Collaborations

Information about the articles that the user has accessed can help in directing to users with similar interests and aid in collaboration.


Spreading Activation

Spreading activation is a technique which is used to retrieve relevant information if it is associated with information already known to be relevant. This technique is based on the Spreading Activation Model which is based on the idea of Human Memory operation.

Wikipedia Category Network is a Thesaurus

Prediction Algorithm

Input Documents

Wikipedia Articles Index


Matching Documents


Matching Documents

Wikipedia Article Network resembles the WWW Network

Activate Category Nodes

Activate Document Nodes

Compute Edge Weights

Compute Edge Weights

Calculating Edge Weights

Spreading Activation Category Links Graph

Spreading Activation Page Links Graph

Page Links Graph

Category Links Graph

Wikipedia Articles Index


Lucene Index

Get Category Title

Map Predicted Node to Page Title

Map Predicted Node to Category Title

Get Page Title

Wikipedia Database

Expensive Computations

Computing Edge Weights dynamically using lucene similarity score between a pair of linked documents is a computationally expensive process. One approach is to run the spreading activation algorithm in parallel. Each node that gets activated could dynamically compute the edge weights and activate its successors in parallel. Secondly, Spreading Activation involves Matrix Operations which could be done more efficiently on cell processor.

Opportunities for Parallelism

Display Predicted Category

Display Predicted Concept