INSYS 300 Text Analysis

INSYS 300Text Analysis Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Improving the Indexing • So far we treated words simply as tokens when creating the inverted indexing • To improve the indexing, we should also consider • meanings of words • Structures of language • Word usages

Text Analysis • Word (token) extraction • Stop words • Stemming • Word Frequency counts • Inverted Document Frequency • Zipf’s Law

Stop words • Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • Why do we need to remove stop words? • Reduce indexing file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching • stop words always have a large number of hits

Stop words • Potential problems of removing stop words • small stop list does not improve indexing much • large stop list may eliminate some words that might be useful for someone or for some purposes • stopwords might be part of phrases • needs to process for both indexing and queries.

Stemming • Techniques used to find out the root/stem of a word: • lookup “user engineering” • user 15 engineering 12 • users 4 engineered 23 • used 5 engineer 12 • using 5 • stem: use engineer

Advantages of stemming • improving effectiveness • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. • Criteria for stemming • correctness • retrieval effectiveness • compression performance

Basic stemming methods • Use tables and rules • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …...

transform the remaining word • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

Example 1: Porter stem Algorithm • A set of condition/action rules • condition on the stem • condition on the suffix • condition on the rules • different combination of conditions will activate different rules. • Implementation: • stem.c • Stem(word) • …….. • ReplaceEnd(word, step1a_rule); • rule=ReplaceEnd(word, step1b_rule); • if (rule==106) || (rule ==107) • ReplaceEnd(word, 1b1_rule); • … …

Example 2: Sound-based stemming • Soundex rules: • letter Numeric equivalent • B, F, P, V 1 • C, G, J, K, Q, S, X, Z 2 • D, T, 3 • L 4 • M, N, 5 • R, 6 • A, E, I, O, U, W, Y not coded • Words sound similar often have same codes • The code is not unique • high compression rate

Example 3: N-gram stemmers • A n-gram is n-consecutive letters • Digram is 2 consecutive letters • Trigram is 3 consecutive letters • All diagrams of the word “statistics” are • st ta at ti is st ti ic cs • Unique: at cs icisst ta ti • All diagrams of “statistical” are • st ta ti is st ti ic ca al • Unique: al at ca icisst ta ti

The similarity of two words can be calculated by: • Where • A is the number of unique diagrams in the first word • B is the number of unique diagrams in the second word • c is the number of unique diagrams share by A and B

Frequency counts • The idea • The best a computer can do is counting numbers • counts the number of times a word occurred in a document • counts the number of documents in a collection that contains a word • Using occurrence frequencies to indicate relative importance of a word in a document • if a word appears often in a document, the document likely “deals with” subjects related to the word.

Using occurrence frequencies to select most useful words to index a document collection • if a word appears in every document, it is not a good indexing word • if a word appears in only one or two documents, it may not be a good indexing word • If a word appears in titles, each occurrence should be count 5(or 10) times.

Salton’s Vector Space • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term is in the document • Wi= 0 if the term is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*(1+log(N/dfi))where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

Inverted Document Frequency • Where N is the total number of documents; Dk is the number of documents contains the K term . th

IDF-based Indexing

Example: • D1: a b c a f o n l p o f t y x • D2: a m o e e e n n n a n p l • D3: r a c e e f n l i f f f f x l • D4: a f f f f c d e e f g h l l x • Calculate term frequencies for term a, b, and c in each document. • Calculate inverted document frequencies of a, b, and c.

Automatic indexing 1. Parse individual words (tokens) 2. Remove stop words. 3. Stemming 4. Use frequency data • decide heading threshold • decide tail threshold • decide variance of counting

5. Create indexing structure • invert indexing • other structures

More about Counting • Zipf’s Law • in a large, well-written English document, r * f = c where r is the ranking number, f is the number of times the given word is used in the document; c is a constant.

Zipf’s Law is an observation of a fact in proximity. • Examples: • Word frequencies in Alice in Wonderland • Zipf’s Law has been verified for many many years on many different collections. • There are also many revised version of Ziph’s Law.

More about Counting • English Letter Usage Statistics • Letter use frequencies: • E: 72881 12.4% • T: 52397 8.9% • A: 47072 8.0% • O: 45116 7.6% • N: 41316 7.0% • I: 39710 6.7% • H: 38334 6.5%

Doubled letter frequencies: • LL: 2979 20.6% • EE: 2146 14.8% • SS: 2128 14.7% • OO: 2064 14.3% • TT: 1169 8.1% • RR: 1068 7.4% • --: 701 4.8% • PP: 628 4.3% • FF: 430 2.9%

Initial letter frequencies: • T: 20665 15.2% • A: 15564 11.4% • H: 11623 8.5% • W: 9597 7.0% • I: 9468 6.9% • S: 9376 6.9% • O: 8205 6.0% • M: 6293 4.6% • B: 5831 4.2%

Ending letter frequencies: • E: 26439 19.4% • D: 17313 12.7% • S: 14737 10.8% • T: 13685 10.0% • N: 10525 7.7% • R: 9491 6.9% • Y: 7915 5.8% • O: 6226 4.5%

Term Associations • Counting word pairs • If two words appear together very often, they are likely to be a phrase • Counting document pairs • if two documents have many common words, they are likely related

More Counting • Counting citation pairs • If documents A and B both cite document C, D, then A and B might be related. • If documents C and D often be cited together, they are likely related.

Co-Citation • The college has more than 20 years tradition on Co-citation research. • Co-citation is the mentioning of any two earlier documents in the bibliographic references of a later third document. Document 1 cites Later Document 3 ? cites Document 2

Co-Citation Analysis • The count of mentions may grow over time as new writings appear. Thus, co-citation counts can reflect citers’ changing perceptions of documents as more or less strongly related. • Documents shown to be related by their co-citation counts can be mapped as proximate in intellectual space.

Co-Citation Mapping • Detects patterns in the frequency with which any works by any two authors are jointly cited in later works. • Only recurrent co-citation is significant: The more times authors are cited together, the more strongly related they are in the eyes of citers.

AuthorLinks

Midterms • Concepts • What is information retrieval? • Data, information, text, and documents • What is a controlled vocabulary? • Two abstractions principles • Considerations of document representation • Queries and query formats • What is a document vector space? • What is tf and idf ?

Procedure & problem solving • steps of creating automatic indexing • creating vector spaces • calculating similarity • calculating tf and idf • Boolean query matching • Vector query matching • Discussions • Advantages and disadvantages of …. • What can we do to improve automatic indexing? • Why do we do … …

INSYS 300 Text Analysis

INSYS 300 Text Analysis

Presentation Transcript

INSYS 300 Search Engine Design

Search And Text Analysis

Text-Mining: analysis of text data

Pragmatics and Text Analysis

Text Analysis I

Quantitative Text Analysis

Presentation 1 Text Analysis

Visual Text analysis

Qualitative text analysis

Bioprospecting: A Text Analysis

Social Text Analysis

Super Text Panel SH-300

Text Structure Analysis

Biomedical Text Analysis

Text Analysis and History

Text Analysis and Ontologies

Text Boundary Analysis

KKK Text Analysis

Text-Analysis

Text analysis

300 Text Loans @ http://www.300textloans.co/

Comfort INN Insys Bangalore