1 / 35

INSYS 300 Text Analysis

INSYS 300 Text Analysis. Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University. Improving the Indexing. So far we treated words simply as tokens when creating the inverted indexing To improve the indexing, we should also consider

gusty
Download Presentation

INSYS 300 Text Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INSYS 300Text Analysis Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

  2. Improving the Indexing • So far we treated words simply as tokens when creating the inverted indexing • To improve the indexing, we should also consider • meanings of words • Structures of language • Word usages

  3. Text Analysis • Word (token) extraction • Stop words • Stemming • Word Frequency counts • Inverted Document Frequency • Zipf’s Law

  4. Stop words • Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • Why do we need to remove stop words? • Reduce indexing file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching • stop words always have a large number of hits

  5. Stop words • Potential problems of removing stop words • small stop list does not improve indexing much • large stop list may eliminate some words that might be useful for someone or for some purposes • stopwords might be part of phrases • needs to process for both indexing and queries.

  6. Stemming • Techniques used to find out the root/stem of a word: • lookup “user engineering” • user 15 engineering 12 • users 4 engineered 23 • used 5 engineer 12 • using 5 • stem: use engineer

  7. Advantages of stemming • improving effectiveness • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. • Criteria for stemming • correctness • retrieval effectiveness • compression performance

  8. Basic stemming methods • Use tables and rules • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …...

  9. transform the remaining word • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

  10. Example 1: Porter stem Algorithm • A set of condition/action rules • condition on the stem • condition on the suffix • condition on the rules • different combination of conditions will activate different rules. • Implementation: • stem.c • Stem(word) • …….. • ReplaceEnd(word, step1a_rule); • rule=ReplaceEnd(word, step1b_rule); • if (rule==106) || (rule ==107) • ReplaceEnd(word, 1b1_rule); • … …

  11. Example 2: Sound-based stemming • Soundex rules: • letter Numeric equivalent • B, F, P, V 1 • C, G, J, K, Q, S, X, Z 2 • D, T, 3 • L 4 • M, N, 5 • R, 6 • A, E, I, O, U, W, Y not coded • Words sound similar often have same codes • The code is not unique • high compression rate

  12. Example 3: N-gram stemmers • A n-gram is n-consecutive letters • Digram is 2 consecutive letters • Trigram is 3 consecutive letters • All diagrams of the word “statistics” are • st ta at ti is st ti ic cs • Unique: at cs icisst ta ti • All diagrams of “statistical” are • st ta ti is st ti ic ca al • Unique: al at ca icisst ta ti

  13. The similarity of two words can be calculated by: • Where • A is the number of unique diagrams in the first word • B is the number of unique diagrams in the second word • c is the number of unique diagrams share by A and B

  14. Frequency counts • The idea • The best a computer can do is counting numbers • counts the number of times a word occurred in a document • counts the number of documents in a collection that contains a word • Using occurrence frequencies to indicate relative importance of a word in a document • if a word appears often in a document, the document likely “deals with” subjects related to the word.

  15. Using occurrence frequencies to select most useful words to index a document collection • if a word appears in every document, it is not a good indexing word • if a word appears in only one or two documents, it may not be a good indexing word • If a word appears in titles, each occurrence should be count 5(or 10) times.

  16. Salton’s Vector Space • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term is in the document • Wi= 0 if the term is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*(1+log(N/dfi))where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

  17. Inverted Document Frequency • Where N is the total number of documents; Dk is the number of documents contains the K term . th

  18. IDF-based Indexing

  19. Example: • D1: a b c a f o n l p o f t y x • D2: a m o e e e n n n a n p l • D3: r a c e e f n l i f f f f x l • D4: a f f f f c d e e f g h l l x • Calculate term frequencies for term a, b, and c in each document. • Calculate inverted document frequencies of a, b, and c.

  20. Automatic indexing 1. Parse individual words (tokens) 2. Remove stop words. 3. Stemming 4. Use frequency data • decide heading threshold • decide tail threshold • decide variance of counting

  21. 5. Create indexing structure • invert indexing • other structures

  22. More about Counting • Zipf’s Law • in a large, well-written English document, r * f = c where r is the ranking number, f is the number of times the given word is used in the document; c is a constant.

  23. Zipf’s Law is an observation of a fact in proximity. • Examples: • Word frequencies in Alice in Wonderland • Zipf’s Law has been verified for many many years on many different collections. • There are also many revised version of Ziph’s Law.

  24. More about Counting • English Letter Usage Statistics • Letter use frequencies: • E: 72881 12.4% • T: 52397 8.9% • A: 47072 8.0% • O: 45116 7.6% • N: 41316 7.0% • I: 39710 6.7% • H: 38334 6.5%

  25. Doubled letter frequencies: • LL: 2979 20.6% • EE: 2146 14.8% • SS: 2128 14.7% • OO: 2064 14.3% • TT: 1169 8.1% • RR: 1068 7.4% • --: 701 4.8% • PP: 628 4.3% • FF: 430 2.9%

  26. Initial letter frequencies: • T: 20665 15.2% • A: 15564 11.4% • H: 11623 8.5% • W: 9597 7.0% • I: 9468 6.9% • S: 9376 6.9% • O: 8205 6.0% • M: 6293 4.6% • B: 5831 4.2%

  27. Ending letter frequencies: • E: 26439 19.4% • D: 17313 12.7% • S: 14737 10.8% • T: 13685 10.0% • N: 10525 7.7% • R: 9491 6.9% • Y: 7915 5.8% • O: 6226 4.5%

  28. Term Associations • Counting word pairs • If two words appear together very often, they are likely to be a phrase • Counting document pairs • if two documents have many common words, they are likely related

  29. More Counting • Counting citation pairs • If documents A and B both cite document C, D, then A and B might be related. • If documents C and D often be cited together, they are likely related.

  30. Co-Citation • The college has more than 20 years tradition on Co-citation research. • Co-citation is the mentioning of any two earlier documents in the bibliographic references of a later third document. Document 1 cites Later Document 3 ? cites Document 2

  31. Co-Citation Analysis • The count of mentions may grow over time as new writings appear. Thus, co-citation counts can reflect citers’ changing perceptions of documents as more or less strongly related. • Documents shown to be related by their co-citation counts can be mapped as proximate in intellectual space.

  32. Co-Citation Mapping • Detects patterns in the frequency with which any works by any two authors are jointly cited in later works. • Only recurrent co-citation is significant: The more times authors are cited together, the more strongly related they are in the eyes of citers.

  33. AuthorLinks

  34. Midterms • Concepts • What is information retrieval? • Data, information, text, and documents • What is a controlled vocabulary? • Two abstractions principles • Considerations of document representation • Queries and query formats • What is a document vector space? • What is tf and idf ?

  35. Procedure & problem solving • steps of creating automatic indexing • creating vector spaces • calculating similarity • calculating tf and idf • Boolean query matching • Vector query matching • Discussions • Advantages and disadvantages of …. • What can we do to improve automatic indexing? • Why do we do … …

More Related