Modeling the Internet and the Web: Text Analysis

Modeling the Internet and the Web:Text Analysis

Outline • Indexing • Lexical processing • Content-based ranking • Probabilistic retrieval • Latent semantic analysis • Text categorization • Exploiting hyperlinks • Document clustering • Information extraction

Information Retrieval • Analyzing the textual content of individual Web pages • given user’s query • determine a maximally related subset of documents • Retrieval • index a collection of documents (access efficiency) • rank documents by importance (accuracy) • Categorization (classification) • assign a document to one or more categories

Indexing • Inverted index • effective for very large collections of documents • associates lexical items to their occurrences in the collection • Terms  • lexical items: words or expressions • Vocabulary V • the set of terms of interest

Inverted Index • The simplest example • a dictionary • each key is a term   V • associated value b() points to a bucket (posting list) • a bucket is a list of pointers marking all occurrences of  in the text collection

Inverted Index • Bucket entries: • document identifier (DID) • the ordinal number within the collection • separate entry for each occurrence of the term • DID • offset (in characters) of term’s occurrence within this document • present a user with a short context • enables vicinity queries

Inverted Index

Inverted Index Construction • Parse documents • Extract terms i • if i is not present • insert iin the inverted index • Insert the occurrence in the bucket

Searching with Inverted Index • To find a term  in an indexed collection of documents • obtain b() from the inverted index • scan the bucket to obtain list of occurrences • To find k terms • get k lists of occurrences • combine lists by elementary set operations

Inverted Index Implementation • Size = (|V|) • Implemented using a hash table • Buckets stored in memory • construction algorithm is trivial • Buckets stored on disk • impractical due to disk assess time • use specialized secondary memory algorithms

Bucket Compression • Reduce memory for each pointer in the buckets: • for each term sort occurrences by DID • store as a list of gaps - the sequence of differences between successive DIDs • Advantage – significant memory saving • frequent terms produce many small gaps • small integers encoded by short variable-length codewords • Example: the sequence of DIDs: (14, 22, 38, 42, 66, 122, 131, 226 ) a sequence of gaps: (14, 8, 16, 4, 24, 56, 9, 95)

Lexical Processing • Performed prior to indexing or converting documents to vector representations • Tokenization • extraction of terms from a document • Text conflation and vocabulary reduction • Stemming • reducing words to their root forms • Removing stop words • common words, such as articles, prepositions, non-informative adverbs • 20-30% index size reduction

Tokenization • Extraction of terms from a document • stripping out • administrative metadata • structural or formatting elements • Example • removing HTML tags • removing punctuation and special characters • folding character case (e.g. all to lower case)

Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE

Content Based Ranking • A boolean query • results in several matching documents • e.g., a user query in google: ‘Web AND graphs’, results in 4,040,000 matches • Problem • user can examine only a fraction of result • Content based ranking • arrange results in the order of relevance to user

Choice of Weights What weights retrieve most relevant pages?

Vector-space Model • Text documents are mapped to a high-dimensional vector space • Each document d • represented as a sequence of terms (t) d = ((1), (2), (3), …, (|d|)) • Unique terms in a set of documents • determine the dimension of a vector space

Example • Boolean representation of vectors: • V = [ web, graph, net, page, complex ] • V1 = [1 1 0 0 0] • V2 = [1 1 1 0 0] • V3 = [1 0 0 1 1]

Vector-space Model • 1, 2 and 3are terms in document, x and x are document vectors • Vector-space representations are sparse, |V| >> |d|

Term frequency (TF) • A term that appears many times within a document is likely to be more important than a term that appears only once • nij - Number of occurrences of a term j in a document di • Term frequency

Inverse document frequency (IDF) • A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents • nj - Number of documents which contain the term j • n - total number of documents in the set • Inverse document frequency

Inverse document frequency (IDF)

Full Weighting (TF-IDF) • The TF-IDF weight of a term j in document di is

Document Similarity • Ranks documents by measuring the similarity between each document and the query • Similarity between two documents d and d is a function s(d, d) R • In a vector-space representation the cosine coefficient of two document vectors is a measure of similarity

Cosine Coefficient • The cosine of the angle formed by two document vectors x and x is • Documents with many common terms will have vectors close to each other, than documents with fewer overlapping terms

Retrieval and Evaluation • Compute document vectors for a set of documents D • Find the vector associated with the user query q • Using s(xi, q), I = 1, ..,n, assign a similarity score for each document • Retrieve top ranking documents R • Compare R with R* - documents actually relevant to the query

Retrieval and Evaluation Measures • Precision () - Fraction of retrieved documents that are actually relevant • Recall () - Fraction of relevant documents that are retrieved

Probabilistic Retrieval • Probabilistic Ranking Principle (PRP) (Robertson, 1977) • ranking of the documents in the order of decreasing probability of relevance to the user query • probabilities are estimated as accurately as possible on basis of available data • overall effectiveness of such as system will be the best obtainable

Probabilistic Model • PRP can be stated by introducing a Boolean variable R (relevance) for a document d, for a given user query q as P(R | d,q) • Documents should be retrieved in order of decreasing probability • d - document that has not yet been retrieved

Latent Semantic Analysis • Why need it? • serious problems for retrieval methods based on term matching • vector-space similarity approach works only if the terms of the query are explicitly present in the relevant documents • rich expressive power of natural language • often queries contain terms that express concepts related to text to be retrieved

Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision

Latent Semantic Indexing(LSI) • A statistical technique • Uses linear algebra technique called singular value decomposition (SVD) • attempts to estimate the hidden structure • discovers the most important associative patterns between words and concepts • Data driven

LSI and Text Documents • Let X denote a term-document matrix X = [x1 . . . xn]T • each row is the vector-space representation of a document • each column contains occurrences of a term in each document in the dataset • Latent semantic indexing • compute the SVD of X: •  - singular value matrix • set to zero all but largest K singular values - • obtain the reconstruction of X by:

LSI Example • A collection of documents: d1: Indian government goes for open-sourcesoftware d2: Debian 3.0 Woody released d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0 d4: gnuPOD released: iPOD on Linux… with GPLed software d5: Gentoo servers running at open-source mySQL database d6: Dolly the sheep not totally identical clone d7: DNA news: introduced low-cost human genomeDNA chip d8: Malaria-parasite genomedatabase on the Web d9: UK sets up genome bank to protect rare sheep breeds d10: Dolly’sDNA damaged

LSI Example • The term-document matrix XT d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 1 0 0 0 1 0 0 0 0 0 software 1 0 0 1 0 0 0 0 0 0 Linux 0 0 0 1 0 0 0 0 0 0 released 0 1 1 1 0 0 0 0 0 0 Debian 0 1 1 0 0 0 0 0 0 0 Gentoo 0 0 1 0 1 0 0 0 0 0 database 0 0 0 0 1 0 0 1 0 0 Dolly 0 0 0 0 0 1 0 0 0 1 sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0 DNA 0 0 0 0 0 0 2 0 0 1

LSI Example • The reconstructed term-document matrix after projecting on a subspace of dimension K=2 •  = diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10) d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01 software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04 Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02 Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01 database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12 Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21 sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16 genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53 DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81

Probabilistic LSA • Aspect model (aggregate Markov model) • let an event be the occurrence of a term  in a document d • let z{z1, … , zK} be a latent (hidden) variable associated with each event • the probability of each event (, d) is • select a document from a density P(d) • select a latent concept z with probability P(z|d) • choose a term , sampling from P(|z)

Aspect Model Interpretation • In a probabilistic latent semantic space • each document is a vector • uniquely determined by the mixing coordinates P(zk|d), k=1,…,K • i.e., rather than being represented through terms, a document is represented through latent variables that in tern are responsible for generating terms.

Analogy with LSI • all n x m document-term joint probabilities • uik = P(di|zk) • vjk = P(j|zk) • kk = P(zk) • P is properly normalized probability distribution • entries are nonnegative

Fitting the Parameters • Parameters estimated by maximum likelihood using EM • E step • M step

Text Categorization • Grouping textual documents into different fixed classes • Examples • predict a topic of a Web page • decide whether a Web page is relevant with respect to the interests of a given user • Machine learning techniques • k nearest neighbors (k-NN) • Naïve Bayes • support vector machines

k Nearest Neighbors • Memory based • learns by memorizing all the training instances • Prediction of x’s class • measure distances between x and all training instances • return a set N(x,D,k) of the k points closest to x • predict a class for x by majority voting • Performs well in many domains • asymptotic error rate of the 1-NN classifier is always less than twice the optimal Bayes error

Naïve Bayes • Estimates the conditional probability of the class given the document •  - parameters of the model • P(d) – normalization factor (cP(c|d)=1) • classes are assumed to be mutually exclusive • Assumption: the terms in a document are conditionally independent given the class • false, but often adequate • gives reasonable approximation • interested in discrimination among classes

Bernoulli Model • An event – a document as a whole • a bag of words • words are attributes of the event • vocabulary term  is a Bernoully attribute • 1, if  is in the document • 0, otherwise • binary attributes are mutually independent given the class • the class is the only cause of appearance of each word in a document

Bernoulli Model • Generating a document • tossing |V| independent coins • the occurrence of each word in a document is a Bernoulli event • xj= 1[0] - jdoes [does not] occur in d • P(j|c) – probability of observing jin documents of class c

Multinomial Model • Document – a sequence of events W1,…,W|d| • Take into account • number of occurrences of each word • length of the document • serial order among words • significant (model with a Markov chain) • assume word occurrences independent – bag-of-words representation

Learning Naïve Bayes • Estimate parameters  from the available data • Training data set is a collection of labeled documents { (di, ci), i = 1,…,n }

Learning Bernoulli Model • c,j = P(j|c), j = 1,…,|V|, c = 1,…,K • estimated as • Nc = |{ i : ci =c }| • xij = 1 if j occurs in di • class prior probabilities c = P(c) • estimated as

Learning Multinomial Model • Generative parameters c,j = P(j|c) • must satisfy jc,j = 1 for each class c • Distributions of terms given the class • qjand  are hyperparameters of Dirichlet prior • nij is the number of occurrences of j in di • Unconditional class probabilities

Modeling the Internet and the Web: Text Analysis