slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics PowerPoint Presentation
Download Presentation
Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Loading in 2 Seconds...

play fullscreen
1 / 44

Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Vritti Cognitive Search – Discovering concepts and trends in large body of text MS Computer Science Project, Mid term presentation. Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics Madurai Kamaraj University, Madurai. S Gopi 092504174

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics' - vivi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

VrittiCognitive Search – Discovering concepts and trends in large body of textMS Computer Science Project, Mid term presentation

Under the guidance of

Dr. R. Bhaskaran

Head of Department - School of Mathematics

Madurai Kamaraj University, Madurai

S Gopi

092504174

Course: MS (Computer Science)

MS Computer Science, Manipal University

table of contents
Table of Contents
  • Introduction
  • Literature Survey
  • Methodology and Progress
  • Proposed Future work and timeline

MS Computer Science, Manipal University

introduction
Introduction
  • Objective
  • Motivation
  • Goal

MS Computer Science, Manipal University

objective
Objective
  • Develop a system named Vritti for extracting concepts and trends from large body of text.
  • Vritti follows “open literature discovery process” to mine through documents to extract meaningful information buried inside.
  • Augment a range of text mining/ information retrieval algorithms to the existing keyword search framework

MS Computer Science, Manipal University

concepts and trends
Concepts and Trends
  • We define concept as a word or a phrase which describes a meaningful subject within a particular field.
  • Vritti discovers concepts within the context of the corpus under consideration
  • Trends are defined as recurring concepts in multiple documents inside the corpus

MS Computer Science, Manipal University

vritti text exploration
Vritti Text Exploration
  • The first step of text exploration is search, followed by discovering concepts and their associated relationships.
  • Equipped with these concepts which present a high level view of the underlying documents, the users should be able to search/infer information from large body of text with ease.

MS Computer Science, Manipal University

motivation
Motivation
  • Subsumption - A learner, supported by an appropriate environment, shall be able to attach a new concept to those existent inside his/her cognitive structure.
  • Vritti aims to apply the same for searching and text exploration. Make search a more natural phenomenon by enhancing the search experience of the information seeker.

Joseph D. Novak & Alberto J. Cañas

Florida Institute for Human and Machine Cognition

Technical Report IHMC CmapTools 2006-01 Rev 2008-01

MS Computer Science, Manipal University

slide8
Goal
  • Increased search effectiveness, by presenting the users with concepts; in addition to documents matching the given query.
  • Concepts will be derived from the search results.
  • Allow the users to interact between the search results and discovered concepts in the form of query expansion or modification

MS Computer Science, Manipal University

literature survey
Literature Survey
  • Text exploration
    • Literature Based Discovery (LBD)
    • Berry picking
  • IR Models and Weighting Schemes
    • Vector space models
    • Term weighting schemes
    • Search Ranking Schemes
  • Concept Definition and Discovery
    • Word space models
    • Random Projections
    • Document Clustering
      • Lingo
      • Non Negative Matrix Factorization
    • Scalar Clustering

MS Computer Science, Manipal University

literature based discovery lbd
Literature Based Discovery (LBD)
  • Concept discovery in text was hugely popularized by the work of Dr Swanson in trying to identify the relationship between fish oil and Reynaud’s syndrome.
  • Focus of Dr. Swanson’ work was to identify concepts and their relationship in bibliographic databases. His technique is known as Literature Based Discovery (LBD) and he defines it as a process of finding complementary structures in disjoint science literature.

Janneck, M. C. (2006). Recent Advances in Literature Based Discovery. Journal of the

American Society for Information Science and Technology, JASIST.

MS Computer Science, Manipal University

lbd open discovery process
LBD Open discovery process

MS Computer Science, Manipal University

berry picking
Berry Picking
  • Why is it necessary for the searcher to find a way to represent the information need in a query, understandable by the system?
  • Why not the system make it possible for the searchers to express the need directly as they would ordinarily, instead of in an artificial query representation for the system consumption.

Berry picking challenges current keyword search methodology in four areas

1. Nature of the query

2. Nature of the overall search process

3. Range of search techniques used

4. Information domain or territory where the search is conducted

J.Bates, M. (1989). The design of browsing and berrypicking techniques for the online search

interface [Quick Edit] . Online Information Retrieval, 407-424.

MS Computer Science, Manipal University

traditional search vs berry picking
Traditional Search vs. Berry Picking

MS Computer Science, Manipal University

ir models and weighting schemes
IR Models and Weighting Schemes

Information Retrieval Model

  • Central premise of any information retrieval system is to identify relevant and irrelevant documents for a given query.
  • They perform this relevance using a ranking algorithm. Ranking algorithms use index terms. An index term is simply a word whose semantics helps in remembering the document’s main theme.

Ricardo Baeza Yates, B. R. (1999). Modern Information Retrieval. Association for

Computing Machinery Inc (ACM).

MS Computer Science, Manipal University

vector space model
Vector Space Model
  • In vector space model (Yang, 1975) every document represented by a multidimensional vector.
  • Each component of the vector is a particular keyword in the document.
  • The value of the component depends on the degree of relationship between the term and the underlying document. Term weighting schemes decide the relationship between the term and the document.
  • Vector cosine similarity decides document query or document- document similarity

Yang, G. A. (1975, Nov). A vector space model for automatic indexing. Communications of

the ACM.

MS Computer Science, Manipal University

ir model math schemes
IR Model Math Schemes
  • Several mathematical schemes based on the type of IR models have been developed to identify index terms.
    • Spark Jones developed IDF, the Inverse document frequency weighting.
    • Probabilistic IDF, called IDFP was developed by Robertson.
    • All the above mentioned weighting schemes decide the weight of a term based on its presence in the document

Robertson, S. (2004). Understanding Inverse Document Frequency: On theoritical

Arguments of IDF. Journal of Documentation, 503-520.

K, S. J. (1972). A statisitical interpretation of term specificity and its application in retrieval.

Journal of Documentation, 11-21.

MS Computer Science, Manipal University

term weighting
Term Weighting
  • Binary: Simplest case, the association is binary: aij=1 when keyword i occurs in document j, aij =0 otherwise.
  • Term frequency:aij= tfij, where tfij denotes how many times term i occurs in document j.
  • TF-IDF:aij= tfij . log(N/dfi), where dfi denotes the number of documents in which term i appears and N represents the total number of documents in the collection.

Introduction to information retrieval.

Christopher D Manning, PrabhakarRaghavan, HinrichSchutze, Cambridge University Press

MS Computer Science, Manipal University

search ranking schemes
Search Ranking Schemes
  • Combination of the Vector Space Model Boolean model to determine how relevant a given Document is to a User's query.
  • Boolean model to first narrow down the documents that need to be scored based on the use of Boolean logic in the Query specification.
  • More times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query.
  • The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM). A document whose vector is closer to the query vector in that model is scored higher.

Apache Lucene, Search Ranking Scheme

MS Computer Science, Manipal University

concept definition and discovery
Concept Definition and Discovery
  • Concept is a word or a phrase which describes a meaningful subject within a particular field.
  • Principal Orthogonal vectors in VSM are good concept candidates
  • Non Poisson distributed word or co-occurring words are good concept candidates.

Srinivasan, P. (1992). Thesaurus Construction. In W. F. Baeza-Yates, Information Retrieval:

Data Structures & Algorithm (pp. 161-218). Englewood Cliffs: Printice Hall.

MS Computer Science, Manipal University

word space models
Word Space Models
  • VSM treat words as indicator of contents, there is no exact matching from words to concepts.
  • In word space model, a high dimensional vector space is produced by collecting the data in a co-occurrence matrix F, such that each row Fw represents a unique word w and each column Fc represents a context c, typically a multi word segment such as a document or word.
    • Latent Semantic Analysis (LSA) is an example of a word space model that uses document based co-occurrence
    • Hyperspace analogue to Language (HAL) is an example of a model that uses word based co-occurrences.

Asoh, L. S. (2001). Computing with Large Random Patterns.

MS Computer Science, Manipal University

random projections
Random Projections
  • Accumulate context vectors based on the occurrence of words in context.
  • Two step operation
    • First, each context (e.g. each document or each word) in the data is assigned a unique and randomly generated representation called an index vector. These index vectors are sparse, high-dimensional and ternary, that is their dimensionality is on the order of thousands, and that they consist of a small number of randomly distributed +1s and -1s, with the rest of the elements of the vector set to 0.
    • Then, context vectors are produced by scanning through the text, and each time a word occurs in a context, that context’s d-dimensional index vector is added to the context vector for the word in question. Words are thus represented by d-dimensional context vectors that are effectively the sum of the words’ context.

Kanerva.P. (1988). Sparse Distributed Memory. The MIT Press.

Sahlgren, M. (2005). An Introduction to Random Indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005.

MS Computer Science, Manipal University

random projections1
Random Projections
  • Every word / document is represented as a vector, which is a sum of all the corresponding context vectors.
  • Searching for a word, can be performed at a context / concept level.
  • Incremental method, context vectors can be used for similarity even after a few examples.
  • Dimensionality d, does not change. New examples, does not change d, hence method is scalable for large data sets.

MS Computer Science, Manipal University

document clustering
Document Clustering
  • Documents tend to cluster around underlying concepts they represent
  • Clustering search results is a way of discovering concepts in a document corpus
  • Vritti implements two document clustering algorithms
    • Lingo
    • Non Negative Matrix Factorization

MS Computer Science, Manipal University

lingo algorithm
Lingo Algorithm

Lingo combines common phrase discovery and Latent semantic indexing to separate documents into meaningful groups

Input Text

Term document Matrix of index term, TFID weighting

Extract Frequent Phrases

Appears a specific number of times

Not cross sentence boundary

Neither begin or end with a stop word

For every freq phrase, a col vector is made over term space

Singular Value Decomposition

USVt, take k singular values

Terms vsFreq phrase matrix

Abstract concepts vs terms matrix

Weiss, S. O. (2005). A Concept-Driven Algorithm for Clustering Search Results. IEEE

Intelligent Systems.

Concepts vsFreq phrase matrix

Apply VSM to find documents matching Freq phrase for clustering

MS Computer Science, Manipal University

nmf clustering
NMF Clustering
  • Unsupervised learning algorithms such as principal-component analysis and vector quantization can be understood as factorizing a data matrix subject to different constraints. Depending upon the constraints utilized, the resulting factors can be shown to have very different representational properties.
  • In Vritti, we try to leverage this idea to factorize a term document matrix, to find the underlying semantic representation. Similar to SVD, NMF tries to find orthogonal representations which can be a good candidate for concepts.

MS Computer Science, Manipal University

scalar clustering
Scalar Clustering

From input text, create a term document matrix , A

Multiply A with transpose (A) to get term - term matrix, B

Computer Jaccard’s Coefficient for each entry in B, P(AUB)/P(A)+P(B)-P(A ^ B) as C

Make C unit normal and Mutliply C and transpose (C) to get cosine distance, the new matrix is D

Matrix D is a term term matrix, having a Analogous score for each term and its associated term.

Scalar clustering will be used to find analogous words to every word.

MS Computer Science, Manipal University

methodology and progress
Methodology and Progress
  • Literature Survey
  • Data Preparation
  • Algorithm Selection and Validation
  • System Use Cases / Story Boards
  • User Interface Design
  • High Level System Design
  • System Build and Unit Testing
  • System testing
  • Documentation and Final write up

* Green color represents completed tasks

MS Computer Science, Manipal University

2 data preparation
2. Data Preparation
  • Data Source
    • For building and testing Vritti we will use National Science Foundation (NSF) Research award abstracts 1990-2003 data set This dataset contains,129,000 abstracts describing NSF awards for basic research.
  • Index Creation
    • Ingested data will be stored as inverted indices for faster search performance.
    • Apache Lucene will be used for storing the inverted index.

MS Computer Science, Manipal University

inverted index data dictionary
Inverted index data dictionary

MS Computer Science, Manipal University

2 data preparation1
2. Data Preparation
  • Random Projection will beimplemented using the semantic vector package, a project by University of Pittsburgh office of technology management.
  • Word statistics will be stored in an in memory database, called Redis.

MS Computer Science, Manipal University

3 algorithm selection and validation
3. Algorithm Selection and Validation
  • Term Weighting – TFIDF
  • Search Ranking – VSM and Boolean Model
  • Concept Extraction
    • Words ranked by IDF
    • Words not following a Poisson distribution
    • Word co-occurrence pattern for key phrase extraction
    • First order, second order and third order word associations
    • Scalar clustering for word analogue extractions
    • Random projection index vectors for concept searching
    • Lingo document clustering
    • NMF Clustering

MS Computer Science, Manipal University

4 system use cases and story boards
4. System Use Cases and Story Boards
  • Story board 1 – Data Loading
    • Users can load documents, pop 3 account or an URL. Vritti will create inverted index, random projections and word statistics of the corpus
  • Story board 2 – High Level view of corpus
    • Keyword and Key phrase display
    • For ever keyword and key phrase, associated words till third level of association will be displayed
    • The user can start a concept search or a keyword search based on these keywords.

MS Computer Science, Manipal University

slide33

Story Board 3- Search

    • Search displays top N matching documents
    • For each search, the top N words which distinguishes the search result from the rest of the corpus will be displayed back. Associations and Analogues of these words can be viewed.
    • Based on these words the user can refine his search query and do the search. Vritti will append the selected keywords to the search string.
  • Story Board 4- Search Result Clustering
    • Cluster the search result, user can either select Lingo or NMF clustering algorithm
    • Every cluster will be labeled with a theme, extracted from the documents under the cluster.
    • Association and Analogue for these cluster labels can be found
    • User can start searching by modifying their query based on these cluster labels.

MS Computer Science, Manipal University

5 user interface design
5. User Interface Design

MS Computer Science, Manipal University

slide35

Start

Setup

Analyze

Search

Themes

Vritti

Mining concepts from large

Body of text

MS Computer Science, Manipal University

slide36

Start

Setup

Analyze

Search

Themes

  • Select Data Source
    • Directory / File
    • URL
    • POP3
  • Stop words
  • Advanced Parameters

MS Computer Science, Manipal University

slide37

Start

Setup

Analyze

Search

Themes

MS Computer Science, Manipal University

slide38

Start

Setup

Analyze

Search

Themes

Search

Search result displayed here

MS Computer Science, Manipal University

slide39

Start

Setup

Analyze

Search

Themes

Select Algorithm

For a selected themes, the associated themes and strength of association are displayed here

Themes discovered as displayed here

MS Computer Science, Manipal University

6 high level design
6 High Level Design

MS Computer Science, Manipal University

slide41

Implementation Overview

MS Computer Science, Manipal University

future proposed work 3 months
Future Proposed Work (3 Months)
  • Algorithm design and validation
    • The algorithms listed in the technical and literature survey section will be implemented and validated against the data source.
  • System building and testing
    • As part of this task, Vritti system will be developed in Java, using spring framework. Apache Lucene will be used for inverted index. The front end will be created in HTML5/JavaScript. Individual components of Vritti will be tested and unit test classes for the same will be written and documented. Code documentation in the form of API Help will be generated either using JavaDocs or doxygen
  • Documentation and Final write up

MS Computer Science, Manipal University

vritti applications
Vritti Applications
  • CRM – Analyze customer responses
  • Ticketing systems – Mining for finding

frequently occurring problems / Themes

  • Stock Exchange Trade Chats – Find

suspecting transactions

  • Extending to Social Network applications –

Understanding discussions among members

MS Computer Science, Manipal University

thank you
Thank You

MS Computer Science, Manipal University