1 / 48

Information Representation: Vector Space and Beyond

Information Representation: Vector Space and Beyond. Lecture 5. Outline. Representation Big Picture Beyond Boolean Leveraging Natural Word Distribution Standard Word Weighting Vectors-Space (Foundation) Refinement Relevance Feedback Probabilistic Approach Stop word list Thesaurus

coyne
Download Presentation

Information Representation: Vector Space and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Representation: Vector Space and Beyond Lecture 5

  2. Outline • Representation • Big Picture • Beyond Boolean • Leveraging Natural Word Distribution • Standard Word Weighting • Vectors-Space (Foundation) • Refinement • Relevance Feedback • Probabilistic Approach • Stop word list • Thesaurus • Clustering • Access • Public Web Resource (PubMed) • Customized Interfaces

  3. The Big Picture (TBP) • Users need information to satisfy some need …

  4. TBP • Research so far has focused on scholarly information need (to be satisfied through textual information) …

  5. TBP • Assumptions: • Documents can be converted to computable structures • Users will transform (with or without machine assistance) their information needs to queries • Machines can accurately match documents to queries

  6. A more detailed picture: IR Process Surrogates Documents Information need statement User Query Profile

  7. IR Process Decomposition: Documents • Documents cannot be searched as they are • Too many and too large • The transformation process of documents to surrogates is a critical step: REPRESENTATION • Representation is generally conducted a priori • Aim: identify “key” terms or term-patterns and associate these with documents • Aim: to make searching more efficient may involve grouping, clustering, or classification operations

  8. IR Process Decomposition: Users • User needs have to be constrained • Too ambiguous • To be isomorphic to document representations • The transformation of user needs is dependent on the QUERY MODEL used • Numerous approaches have been proposed and tested: • Exact vs. Weighted • Keyword match • Boolean • Vector-space • Probabilistic

  9. Representation - Boolean • Boolean search is also called exact-match search • It matches words in queries with words in the documents based on morphological comparisons (word forms) • It permits use of basic logical operators: AND, OR, NOT, etc.

  10. Boolean • Exact-match search is easy to use! • User’s terms can directly be used and the response can be easily interpreted • But, exact-match searches have their limitations too ...

  11. Limitations of Boolean • Terms may have synonyms or closely related terms that are morphologically different • For example, the term anxiety may be closely related to the term depression in the literature (database)

  12. Limitations of Boolean • If the user is unaware of relationships or chooses to ignore the relationships then the retrieved set may be incomplete! • In full-text searching a single occurrence may trigger retrieval -- degree of relevance not accounted for

  13. Beyond Boolean • Identifying the “central” concepts in documents is difficult and may be inconsistent across indexers • Recognizing relationships among words is not always easy

  14. Distribution of Words • Automated approaches were developed to: • index based directly on content of documents (not interpretation) • take advantage of patterns of word usage • to automatically identify “relevant” terms • multiple terms thus extracted can capture relationships • degree of relevance can also be established if weighted schemes are used

  15. Word Distribution • Luhn (an IBM research scientist) proposed that documents should be indexed based on words in the documents • He based this assertion on the notion that word distribution among documents is not “random”, rather it is deterministic (can be predicted)

  16. Word Distribution • According to Luhn certain extremely low frequency and high frequency terms can be ignored • Terms with medium frequency that actually appear in documents can be selected as the index terms for individual documents

  17. Standard Word Weighting • Salton seized on the idea proposed by Luhn and extended it • He developed the vector-space model for document representation • Simply, in this model documents and queries are represented as an array of values (a vector) • Then document and query vectors are matched for retrieval purposes

  18. Vector-Space Model • Let us assume, we have n index terms in a database, then all the documents in the database would be represented as vectors of the following form: T1 T2 T3 T4 T5 … Tn [W1 W2 W3 W4 W5 … Wn] = vector Above, w1 is the weight corresponding to the term T1, w2 is the weight corresponding to the term T2, and so on.

  19. Sample Documents TI: The structure of negative emotions in a clinical sample of children and adolescents. SO: Journal of Abnormal Psychology PY: Feb98, Vol. 107 Issue 1, p74 IS: 12p NT: 0021843X AU: Chorpita, Bruce F.Albano, Anne Marieet al AB: Presents a study which focuses on the factors associated with childhood anxietyand depression with the use of a structural equations/confirmatory factor-analytic approach. Reference to a sample of 216 children and adolescents with diagnoses of an anxiety disorder or comorbid anxiety and mood disorders; Suggestion of results; Discussion on the implications for the assessment of childhood negative emotions. CO: 276712 TI: Depression: A family affair. SO: Lancet PY: 01/17/98, Vol. 351 Issue 9097, p158 IS: 1p NT: 00995355 AU: Faraone, Stephen V.Biederman, Joseph AB: Considers the studies of major depressionand anxiety disorders. The findings with regard to depression being familial and having a genetic component to its complex etiology; Discusses the continuity between child and adult psychiatric disorders, psychiatric comorbidity and the underidentification and treatment of juvenile depression. CO: 116735 • Two abstracts:

  20. Simple Binary Representation • If we index using the two terms anxiety and depression, the representations for the previous two documents would be: T1 T2 T3 T4 T5 [0 0 0 1 1 ] = Document Vector Assuming: 1) T4 = Anxiety and T5 = Depression 2) Terms T1, T2, & T3 are not present in the documents 3) Binary representation

  21. Matching • When a user issues a query to the system, the query is also converted to a vector before a matching is performed • Example: If the user enters the term anxiety as the query term, then the vector for this query would be: T1 T2 T3 T4 T5 [0 0 0 1 0] = Query Vector

  22. Matching • A simple matching technique called inner product can be applied to compute similarity between a query vector and document vectors • Let’s assume we have another document in which the terms anxiety and depression do not appear. Then the vector for that document is: T1 T2 T3 T4 T5 [1 0 0 0 0 ] = Another document

  23. Matching • The similarity computation for the other document would produce a result of 0, as following: Similarity (query, document) = Q x D vectors = [0 0 0 1 0] = Query Vector X [1 0 0 0 0] = Another document ------------------------------------------------------------------- 0+ 0+ 0+ 0+ 0 = 0

  24. Matching • The inner product similarity computation for the original document vectors would produce a result of 1, as following: Similarity (query, document) = Q x D vectors = [0 0 0 1 0] = Query Vector X [0 0 0 1 1] = Original documents ------------------------------------------------------------------- 0+ 0+ 0+ 1+ 0 = 1

  25. Matching • If the retrieval rule is that similarity result should be > 0 to be considered relevant, the other document would not be retrieved • Note, if the user were to enter depression instead of anxiety the same two documents would have been retrieved • The two documents were “automatically” indexed using both terms

  26. Improving Matching • Note, documents that contain a single occurrence of a significant term would be treated the same as a document that contains the same term multiple times in the binary representation • To distinguish better between different documents based on word frequencies a different representation is needed

  27. Improved Matching • Salton in fact suggested using a weighting scheme more precise than binary representation • He proposed usage of term frequencies as the initial weight for individual terms • Then, he suggested that each weight should be calibrated using the inverse document frequencies of terms

  28. Improved Matching • The formula for this approach is: Wt = Term Frequency of t x Inverse Document Frequency Inverse Document Frequency = log (Number of documents in the database / Number of documents with the term t)

  29. Improved Matching • The idea behind the inverse document frequency is that the true discrimination value for a term should be based on • The size of the document set (database size) • Overall distribution of the term in a database

  30. Improved Matching • If a term appears many times in many documents than it is considered to have “low” discrimination value • Conversely, if a term appears multiple times in a few documents and relatively few times in the other documents than the term is said to have “high” discrimination value

  31. Ranking • Using the same inner product similarity computations different similarity values would be produced for documents when term weights (frequencies) are considered • Then, the output of the system can be ranked using the similarity values • Demo Break: SIFTER

  32. Refinement • One simple refinement is to divide the inner product result using a value based on the length of the query and document vectors (cosine similarity) -- this is a standard normalization approach that accounts for variability in document and query sizes

  33. Refinement - Relevance Feedback • After one iteration of retrieval cycle, when a document set is retrieved based on a given query additional information can be provided to the system as “feedback” • One common approach used is to ASK the USER to identify from the retrieved set documents that the USER considers as relevant

  34. Relevance Feedback • Based on the feedback the user provides, the query can be modified • New terms appearing in documents considered relevant are selected to be added to the query

  35. Relevance Feedback - Probabilistic Retrieval • Another approach involves re-weighting the term weights in the query • The probabilistic retrieval formula takes into consideration not just word distribution in the overall collection, but distribution of words in relevant documents as well

  36. Further Refinement - Search • Using stop words list certain words can be removed from query or documents before vector generation • A hybrid approach toward indexing -- both original content and human assigned index terms may improve vector representation (if treated as part of the document)

  37. Further Refinement • It is possible to use a controlled vocabulary source such as a thesaurus to provide a “domain bound” on a document source and then use term-weighting to “customize” the document representations for the source • The controlled vocabulary list can also be extremely useful to the user as a search aid

  38. Further Refinement - Thesaurus • It is actually possible to automatically enhance/supplement the thesaurus itself in an on-going fashion

  39. Further Refinement - Thesaurus • Using a randomly sampled subset of the document source • TF-IDF can be used to select a subset of terms, then, document-document similarity measure can cluster documents • Documents with related terms are going to cluster -- each cluster may represent an area that may or may not be covered in the thesaurus

  40. Diversity of resources - Problems on the Other Side • Many sources exist • Certain sources are extremely large • Search language and manipulation can be complex • Sources are dynamic

  41. Diversity of resources • To reduce the problem associated with source explosion, meta search engines have been invented • It is possible to search multiple sources simultaneously using a meta-search engine

  42. Source Complexity • However, certain authoritative sources, such as PubMed or the Protein Sequence database (Human Genome Project), are large and offer many options

  43. Access - Metasearch • Entrez project at the National Center for Biotechnology Information has taken a systematic approach toward developing a search engine supporting search customization & development of innovative interfaces • Entrez site http://www.ncbi.nlm.nih.gov/Entrez/

  44. Frontiers • Research is needed to accommodate diverse needs and user groups (e.g., Music IR is an emerging area) • More fundamental work is needed on representation (user needs as well as documents)

  45. Frontiers • The most popular ( & successful) approaches have been based on statistical approaches not NLP • But, statistical approaches are showing their age • There is evidence that NLP techniques can help improve performance … but with some caveats

  46. Frontiers • IR usually deals with large (gigabyte or million document scale) collections that are heterogeneous • NLP is hard to apply on content from different domains • NLP resources (instead of techniques per se) appear to be more helpful in IR, e.g., dictionaries, lexicons, thesauri, ontologies, taxonomies, etc. • Approaches are needed to combine NLP with statistical techniques • Research is needed to deal with dynamic nature of use and domain-evolution

  47. IR resources • Journals • ACM Transactions on Information Systems • Journal of American Society for Information Science & Technology • Information Processing & Management • Journal of Information Retrieval • Magazines • www.dlib.org

  48. IR Sources on the Web • ACM SIGIR • D-LIB • Digital Libraries Initiatives • Information Filtering

More Related