Knowledge and Information Retrieval

Knowledge and Information Retrieval Session 6 Dictionary, Retrieval Evaluation and Text Classification

Agenda Dictionary compression and Retrieval evaluation Machine Learning in Text Classification Unsupervised vs Supervised learning Deep Learning Methods in Text Classification Features selection Evaluation and Metrics

Sec. 5.2 Why compress the dictionary? Search begins with the dictionary We want to keep it in main memory to support high query throughput. Embedded/mobile devices may have very little memory Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important

Ch. 5 Why Compression for inverted indexes? Dictionary • Make it small enough to keep in main memory • Make it so small that you can keep some postings lists in main memory too Postings file(s) • Reduce disk space needed • Decrease time needed to read postings lists from disk • Large search engines keep a significant part of the postings in memory. • Compression lets you keep more in memory

Block Sorted-Based Indexing

Sec. 5.2 Dictionary storage Compression Array of fixed-width entries • ~400,000 terms; 28 bytes/term = 11.2 MB.

Dictionary as a string storage Compression • Pointer to next word shows end of current word • Hope to save up to 60% of dictionary space.

Sec. 5.2 Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. • And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons. Written English averages ~4.5 characters/word. • Exercise: Why is/isn’t this the number to use for estimating the dictionary size? Ave. dictionary word in English: ~8 characters • How do we use ~8 characters per dictionary term? Short words dominate token counts but not type average.

Sec. 5.2 Space for dictionary as a string  Now avg. 11  bytes/term,  not 20. source: https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 400K terms x 19  7.6 MB (against 11.2MB for fixed width)

Blocked storage • We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block. • For k=4, we save (k-1) x3=9 bytes for term pointers, but need an additional k=4 bytes for term lengths. • Reuters-RCV1 are Reduced 5 bytes per four term block 0.5 MB, bringing us down to 7.1MB

Sec. 5.2 Front coding • Front-coding: • Sorted words commonly have long common prefix – store differences only • (for last k-1 in a block of k)

Sec. 5.2 RCV1 dictionary compression summary

Regular Expression The simplest kind of regular expression is a sequence of simple characters. To search for woodchuck, we type /woodchuck/. The expression /Buttercup/ matches any string containing the substring Buttercup; grep with that expression would return the line I’m called little Buttercup.

N-gram Language Models Probabilities are essential in any task in which we have to identify words in noisy, ambiguous input, like speech recognition or handwriting recognition. In the movie Take the Money and Run, Woody Allen tries to rob a bank with a sloppily written hold-up note that the teller incorrectly reads as “I have a gub”. As Russell and Norvig (2002) point out, a language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non-word “I have a gub” or even “I have a gull”.

N-gram In spelling correction, we need to find and correct spelling errors like Their are two midterms in this class, in which There was mistyped as Their. A sentence starting with the phrase There are will be much more probable than one starting with Their are, allowing a spellchecker to both detect and correct these errors. Models that assign probabilities to sequences of words are called language model or LMs. We introduce the simplest model that assigns probabilities LM to sentences and sequences of words, the n-gram.

N-gram An n-gram is a sequence of N gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is the: P(the|its water is so transparent that)

Perplexity The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. For a test set W = w1w2 …wN,:

Naïve bayes conditional probability P(x|y) Naive Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c E C the classifier returns the class ˆ c which has the maximum posterior probability given the document.

Affix Concept in Bahasa Source: Reina et al, Flexible Affix Classification for Stemming Indonesian Language, 2016 Indonesian language (Bahasa) affix consists of prefix, infix, suffix, and confix

Stemming Indonesian Language

Method prefix: {basic prefix} + {root word} The morpheme of the several basic prefixes can be changed become another form, called allomorph and the process is named morphophonemic process. {allomorph prefix} + {root word}

Method The wide variety of affixes, especially in prefix and suffix encourage our approach to create the classification of the basic prefixes including the allomorph prefixes and suffixes based on number of letter of the prefix and number of letter of the suffix. Therefore we use the number of letter to name the classification. E.g. the prefix2 classification consists of prefixes with two letters: ‘ke’, ‘di’, ‘se’, ‘me’, ‘pe’, ’be’, ’te’.

method The advantage of our approach is the flexibility of the affix additional. E.g. the prefix such as ‘kese’, ‘sepe’, ‘keber’, ‘keter’, ‘teper’, ‘berse’, ‘seper’, ‘pemer’, ‘pember’, ‘berpen’ are the additional prefix

Result Source: Reina et al, Flexible Affix Classification for Stemming Indonesian Language, 2016

Retrieval Evaluation Performance of IR Systems measured by efficiency(query time, space, response time) and effectivity(relevance, provides by reviewers)

Review Precision vs Recall Precision: % of selected items that are correct Recall: % of correct items that are selected

Machine Learning Machine learning is an application of ArtificialIntelligence (AI) that: Provides systems the ability to automatically learn from training data and improve from experience without being explicitly programmed. For example, machine learning system can be trained in an email message to learn to distinguish between spam and non-spam messages.

AI and Machine Learning

Machine Learning and Classification Text classification Classification tasks Machine Learning is a broad are of AI concerned with the design and development of algorithm that learn patterns presents in data provided as input.

Given input: A document d A fixed set of classes: C = {c1, c2,…, cJ} A training setofm hand-labeled documents (d1,c1),....,(dm,cm) Determine: A learning method or algorithm which will enable us to learn a classifier γ For a test document d, we assign it the class γ(d) ∈ C

Example Problem Neural Network Typical problem that might be solved by NN The goal of the problem is to predict a person's political inclination based on his or her gender, age, home location, and annual income.

Solution Neural Network (con't) Represents a neural network that predicts the political inclination of a male who is 35 years old, lives in a rural area, and has an annual income of $49,000.00

Text Classification • The process of inserting the documents into the classes i.e., of associating one or more class labels with each document, is commonly referred to as text classification. Remember Bayes rule ? https://id.wikipedia.org/wiki/Teorema_Bayes

Text classification

QnA chatbotwith Facebook • fastText is a library for efficient learning of word representations and sentence classification https://github.com/cedextech/fastText

Spam filteringAnother text classification task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

Sec. 13.1 Document Classification “planning language proof intelligence” Test Data: (AI) (Programming) (HCI) Classes: Planning Semantics Garb.Coll. Multimedia GUI ML Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ...

Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy.

Source: Data Science & Big Data Analytics, 2015

An algorithm is said to be unsupervised when no information on training examples, i.e examples of documents that belong to pre-specified classes, is given as input.

Supervised learning • Naive Bayes (simple, common) • k-Nearest Neighbors (simple, powerful) • Support-vector machines (new, generally more powerful) • Multi Layer Perceptron (MLP) Many commercial systems use a mixture of methods

Training Usually, the larger the number of training examples, the better is the fine tuning of the classifier. If cannot be used to predict the classes, an event commonly referred to as overfitting. To evaluate the classifier, we apply to a set of unseen object – the test set.

Naive Bayes Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.

Gaussian Naïve Bayes from sklearn import datasets iris = datasets.load_iris() from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() y_pred = gnb.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) Number of mislabeled points out of a total 150 points : 6 The parameters σy and μy are estimated using maximum likelihood.

Gender classification using Naive Bayes Classifier

Sentence Classification using Naive Bayes Classifier train = [ ('I love this sandwich.', 'pos'), ('this is an amazing place!', 'pos'), ('I feel very good about these beers.', 'pos'), ('this is my best work.', 'pos'), ('I am tired of this stuff.', 'neg'), cl.classify("This is an amazing library!")

Knowledge and Information Retrieval