Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

VrittiCognitive Search – Discovering concepts and trends in large body of textMS Computer Science Project, Final Presentation Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics Madurai Kamaraj University, Madurai S Gopi 092504174 Course: MS (Computer Science) MS Computer Science, Manipal University

Table of Contents • Project Objective and Goal • Methodology and Design • Algorithms • System design and Implementation • Results • Conclusion • Proposed Future work MS Computer Science, Manipal University

Project Objective • Develop a system named Vritti for extracting concepts and trends from large body of text. • Enable users to search through a large body of text / documents with ease. • Leverage keyword based search framework, by augmenting text mining algorithms MS Computer Science, Manipal University

Project Goal – In technical terms Precision = Relevant Retrieved / Total Retrieved Recall = Relevant Retrieved / Total Relevant • Traditional search focuses on Precision • Vritti focuses on Recall • Vritti uses Berry picking and open literature discovery process MS Computer Science, Manipal University

Vritti Search Traditional Search Precision focused at each iteration • The first step of text exploration is search, followed by discovering concepts and their associated relationships. • Equipped with these concepts which present a high level view of the underlying documents, the users should be able to search/infer information from large body of text with ease. Increased Recall at every cycle MS Computer Science, Manipal University

Project Goal • Algorithms for Increased search effectiveness, in terms of recall, by presenting the users with concepts; in addition to documents matching the given query. • Allow the users to interact between the search results and discovered concepts in the form of query expansion or modification MS Computer Science, Manipal University

Project Methodology • Literature Survey • Data Preparation • Algorithm Selection / Creation and Validation • System Use Cases / Story Boards • User Interface Design • High Level System Design • System Build and Unit Testing • System testing • Documentation and Final write up MS Computer Science, Manipal University

Vritti Algorithms • Search Result Ranking • Keyword Weighting Scheme • Unigram Discovery • Bigram and Trigram Discovery • Collocation Algorithm • Association Discovery Algorithm • Search Result Clustering • NMF Clustering MS Computer Science, Manipal University

1. Search Result Ranking Yang, G. A. (1975, Nov). A vector space model for automatic indexing. Communications of the ACM. Search Result Ranking using vector space. • (Yang, 1975) Every document represented by a multidimensional vector. • Each component of the vector is a particular keyword in the document. • The value of the component depends on the degree of relationship between the term and the underlying document. Term weighting schemes decide the relationship between the term and the document. • Vector cosine similarity decides document query or document- document similarity http://lucene.apache.org/java/docs/index.html MS Computer Science, Manipal University

2. Keyword weighting scheme Ordering principles O1 – Probable relevance is based only on the presence of query terms in the documents. O2 – Probable relevance is based on both the presence and absence of query terms in the documents. Independence Assumptions I1 –Distribution of terms in relevant documents is independent , Distribution in all documents is independent I2 –Distribution of terms in relevant documents is independent Distribution in non-relevant documents is independent. Relevant documents having the term / Relevant documents not having the term Non Relevant documents having the term / Non relevant documents not having the term MS Computer Science, Manipal University

MS Computer Science, Manipal University

3. Unigram Discovery Search • Search the document collection for the given query • Extract terms of length one (unigram) from the search result. • The terms should have a minimum frequency count. Vritti uses a 3 as threshold • The terms should be alpha numeric • The terms should not be English stop words. • Apply weighting scheme on these terms. A weight is derived for each of the term. • Select top ‘N’ terms based on a threshold Candidate Unigrams Filters Weighting MS Computer Science, Manipal University

4. Bigram and Trigram Discovery Search Flow similar to Unigrams Let us say if a search results yields us ‘M’, documents. If we split these documents into ‘n’ words (non-unique), we can eventually have, nC2, bigrams and nC3 trigrams. It is computationally not feasible to process all these bigrams, as most of them may not make any sense. Hence we apply collocation algorithm to extract only meaningful bigrams / trigrams Candidate Bigram / Trigrams Apply Collocation Weighting MS Computer Science, Manipal University

5. Collocation Algorithm • Likelihood ratios are used for collocation • Given two hypothesis likelihood ratio is a number that can tell how much more likely one hypothesis is than the other Hypothesis 1 Hypothesis 1 is a formalization of independence. Occurrence of w2 is independent of the previous occurrence of w1. Hypothesis 2 Hypothesis 2 is a formalization of dependence. It serves as a good evidence for an interesting collocation. Let us say, c1, c2 and c12 are the number of occurrences of w1, w2 and w1w2 in the document collection. We can derive P, P1 and P2 as MS Computer Science, Manipal University

Assuming a binomial distribution of words, The likelihood of getting the counts for w1 and w2 and w1w2 that we actually observe is then Given the above the log likelihood is defined as The bigrams are finally ranked based on their likelihood ratios and top N among them will be selected.

6. Association Discovery • Given a term document matrix, A • Compute transpose of A • Compute the co-weight matrix B by multiplying AT by A. • Compute matrix C by transforming co-weights into pair wise similarities using Jacquard’s coefficient, • Transform C into a row normalized matrix D by converting row vectors into unit vectors. • Compute the transpose of D by changing rows into columns and columns into rows • Compute the cosine similarity matrix E, by multiplying DT with D. Since row i of E represents the neighborhood of term i, for a given row, the nearest neighbor of term i is a term other than itself with the largest similarity value. Thus for every terms, the associated terms are discovered in Vritti. MS Computer Science, Manipal University

7. Search Result Clustering - NMF Generic Non negative matrix factorization problem can be sated as follows, Given a nonnegative matrix A ɛ R m×nand a positive integer k < min {m,n}, find nonnegative matrices W ɛ R m × k and H ɛ R k × n to minimize the function Multiplicative update algorithm is as follows W = rand (m,k) , Initialize W as random dense matrix H = rand(k,n), Initialize H as random dense matrix For i = 1 : max iterations H = H .* (WTA) ./ (WTWH) W = W .* (AHT) ./ (WHHT) Input Text Term document Matrix of index term, TFID weighting NMF Clustering Weight and Feature Matrix MS Computer Science, Manipal University

High Level Design Technology Python 2.7 Web.py PyLucene XAMPP / Apache HTTP NLTK (Natural Language tool kit MS Computer Science, Manipal University

Indexing module Search and Ranking Module MS Computer Science, Manipal University

Text Mining Module MS Computer Science, Manipal University

Work flow module Work flow module follows chain of command design pattern. Context Search Unigram Discovery Chain Command class is the basic processing unit. Weight Assignment Association Chain class links all the command class together. A list of command object when passed to the chain object, they are executed in a serial fashion

Deployment Overview Web browser XAMPP Python Web.py port 8080 PyLucene MS Computer Science, Manipal University

User Interface Landing Page Search Screen MS Computer Science, Manipal University

Unigram Discovery Association Analysis

Search result clustering

Inverted Index • Data Source • For building and testing Vritti we use National Science Foundation (NSF) Research award abstracts 1990-2003 data set This dataset contains,129,000 abstracts describing NSF awards for basic research. • Index Creation • Ingested data are stored as inverted indices for faster search performance. • Apache Lucene is used for storing the inverted index. MS Computer Science, Manipal University

Inverted index data dictionary MS Computer Science, Manipal University

Results • Vritti performs consistently in the recall parameter. • Aim of Vritti is to have a good recall rate without worrying about the precision. However if we remove the cap on number of documents returned for a search result, the precision measure will also increase considerably. MS Computer Science, Manipal University

Conclusion • By focusing on recall, and providing the users with sophisticated text mining capability and query expansion capability, Vritti carves a niche space for itself with in the information retrieval systems available today. • In addition to being a stand-alone system, Vritti can also serve as a platform for text mining professionals to jump start their analysis. Vritti can be expanded by augmenting a range of other technologies, including • Document polarity discovery • Text sentiment analysis • Markov Chain Models for automatic sentence construction • Language models for spell check and query expansion and many others. • Vritti project documents and source code have been uploaded to Google code project, http://code.google.com/p/vritti/. With apache license 2.0, Vritti is open source software now, thus allowing students, researchers and fellow programmers to use, develop and maintain Vritti going forward. MS Computer Science, Manipal University

Vritti Commercial Applications • CRM – Analyze customer responses • Ticketing systems – Mining for finding frequently occurring problems / Themes • Stock Exchange Trade Chats – Find suspecting transactions • Extending to Social Network applications – Understanding discussions among members MS Computer Science, Manipal University

Thank You MS Computer Science, Manipal University

Backup Slides MS Computer Science, Manipal University

Concepts and Trends • We define concept as a word or a phrase which describes a meaningful subject within a particular field. • Vritti discovers concepts within the context of the corpus under consideration • Trends are defined as recurring concepts in multiple documents inside the corpus MS Computer Science, Manipal University

Vritti Text Exploration • The first step of text exploration is search, followed by discovering concepts and their associated relationships. • Equipped with these concepts which present a high level view of the underlying documents, the users should be able to search/infer information from large body of text with ease. MS Computer Science, Manipal University

Motivation • Subsumption - A learner, supported by an appropriate environment, shall be able to attach a new concept to those existent inside his/her cognitive structure. • Vritti aims to apply the same for searching and text exploration. Make search a more natural phenomenon by enhancing the search experience of the information seeker. Joseph D. Novak & Alberto J. Cañas Florida Institute for Human and Machine Cognition Technical Report IHMC CmapTools 2006-01 Rev 2008-01 MS Computer Science, Manipal University

Literature Survey • Text exploration • Literature Based Discovery (LBD) • Berry picking • IR Models and Weighting Schemes • Vector space models • Term weighting schemes • Search Ranking Schemes • Concept Definition and Discovery • Word space models • Random Projections • Document Clustering • Lingo • Non Negative Matrix Factorization • Scalar Clustering MS Computer Science, Manipal University

Literature Based Discovery (LBD) • Concept discovery in text was hugely popularized by the work of Dr Swanson in trying to identify the relationship between fish oil and Reynaud’s syndrome. • Focus of Dr. Swanson’ work was to identify concepts and their relationship in bibliographic databases. His technique is known as Literature Based Discovery (LBD) and he defines it as a process of finding complementary structures in disjoint science literature. Janneck, M. C. (2006). Recent Advances in Literature Based Discovery. Journal of the American Society for Information Science and Technology, JASIST. MS Computer Science, Manipal University

LBD Open discovery process MS Computer Science, Manipal University

Berry Picking • Why is it necessary for the searcher to find a way to represent the information need in a query, understandable by the system? • Why not the system make it possible for the searchers to express the need directly as they would ordinarily, instead of in an artificial query representation for the system consumption. Berry picking challenges current keyword search methodology in four areas 1. Nature of the query 2. Nature of the overall search process 3. Range of search techniques used 4. Information domain or territory where the search is conducted J.Bates, M. (1989). The design of browsing and berrypicking techniques for the online search interface [Quick Edit] . Online Information Retrieval, 407-424. MS Computer Science, Manipal University

Traditional Search vs. Berry Picking MS Computer Science, Manipal University

IR Models and Weighting Schemes Information Retrieval Model • Central premise of any information retrieval system is to identify relevant and irrelevant documents for a given query. • They perform this relevance using a ranking algorithm. Ranking algorithms use index terms. An index term is simply a word whose semantics helps in remembering the document’s main theme. Ricardo Baeza Yates, B. R. (1999). Modern Information Retrieval. Association for Computing Machinery Inc (ACM). MS Computer Science, Manipal University

Vector Space Model • In vector space model (Yang, 1975) every document represented by a multidimensional vector. • Each component of the vector is a particular keyword in the document. • The value of the component depends on the degree of relationship between the term and the underlying document. Term weighting schemes decide the relationship between the term and the document. • Vector cosine similarity decides document query or document- document similarity Yang, G. A. (1975, Nov). A vector space model for automatic indexing. Communications of the ACM. MS Computer Science, Manipal University

IR Model Math Schemes • Several mathematical schemes based on the type of IR models have been developed to identify index terms. • Spark Jones developed IDF, the Inverse document frequency weighting. • Probabilistic IDF, called IDFP was developed by Robertson. • All the above mentioned weighting schemes decide the weight of a term based on its presence in the document Robertson, S. (2004). Understanding Inverse Document Frequency: On theoritical Arguments of IDF. Journal of Documentation, 503-520. K, S. J. (1972). A statisitical interpretation of term specificity and its application in retrieval. Journal of Documentation, 11-21. MS Computer Science, Manipal University

Term Weighting • Binary: Simplest case, the association is binary: aij=1 when keyword i occurs in document j, aij =0 otherwise. • Term frequency:aij= tfij, where tfij denotes how many times term i occurs in document j. • TF-IDF:aij= tfij . log(N/dfi), where dfi denotes the number of documents in which term i appears and N represents the total number of documents in the collection. Introduction to information retrieval. Christopher D Manning, PrabhakarRaghavan, HinrichSchutze, Cambridge University Press MS Computer Science, Manipal University

Search Ranking Schemes • Combination of the Vector Space Model Boolean model to determine how relevant a given Document is to a User's query. • Boolean model to first narrow down the documents that need to be scored based on the use of Boolean logic in the Query specification. • More times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. • The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM). A document whose vector is closer to the query vector in that model is scored higher. Apache Lucene, Search Ranking Scheme MS Computer Science, Manipal University

Concept Definition and Discovery • Concept is a word or a phrase which describes a meaningful subject within a particular field. • Principal Orthogonal vectors in VSM are good concept candidates • Non Poisson distributed word or co-occurring words are good concept candidates. Srinivasan, P. (1992). Thesaurus Construction. In W. F. Baeza-Yates, Information Retrieval: Data Structures & Algorithm (pp. 161-218). Englewood Cliffs: Printice Hall. MS Computer Science, Manipal University

Word Space Models • VSM treat words as indicator of contents, there is no exact matching from words to concepts. • In word space model, a high dimensional vector space is produced by collecting the data in a co-occurrence matrix F, such that each row Fw represents a unique word w and each column Fc represents a context c, typically a multi word segment such as a document or word. • Latent Semantic Analysis (LSA) is an example of a word space model that uses document based co-occurrence • Hyperspace analogue to Language (HAL) is an example of a model that uses word based co-occurrences. Asoh, L. S. (2001). Computing with Large Random Patterns. MS Computer Science, Manipal University

Random Projections • Accumulate context vectors based on the occurrence of words in context. • Two step operation • First, each context (e.g. each document or each word) in the data is assigned a unique and randomly generated representation called an index vector. These index vectors are sparse, high-dimensional and ternary, that is their dimensionality is on the order of thousands, and that they consist of a small number of randomly distributed +1s and -1s, with the rest of the elements of the vector set to 0. • Then, context vectors are produced by scanning through the text, and each time a word occurs in a context, that context’s d-dimensional index vector is added to the context vector for the word in question. Words are thus represented by d-dimensional context vectors that are effectively the sum of the words’ context. Kanerva.P. (1988). Sparse Distributed Memory. The MIT Press. Sahlgren, M. (2005). An Introduction to Random Indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005. MS Computer Science, Manipal University

Random Projections • Every word / document is represented as a vector, which is a sum of all the corresponding context vectors. • Searching for a word, can be performed at a context / concept level. • Incremental method, context vectors can be used for similarity even after a few examples. • Dimensionality d, does not change. New examples, does not change d, hence method is scalable for large data sets. MS Computer Science, Manipal University

Document Clustering • Documents tend to cluster around underlying concepts they represent • Clustering search results is a way of discovering concepts in a document corpus • Vritti implements two document clustering algorithms • Lingo • Non Negative Matrix Factorization MS Computer Science, Manipal University

Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics