1 / 28

Conventional Text-Retrieval Systems

Conventional Text-Retrieval Systems. Hsin-Hsi Chen. Database Management. A specified set of attributes is used to characterize each item. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

kita
Download Presentation

Conventional Text-Retrieval Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conventional Text-Retrieval Systems Hsin-Hsi Chen

  2. Database Management • A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) • Exact match between the attributes used inquery formulations and those attached to the record. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’

  3. Text-Retrieval Systems • Content identifiers (keywords, index terms, descriptors) characterize the stored texts. • degrees of coincidence between the sets of identifiers attached to queries and documents query formulation content analysis

  4. Possible Representation • document representation • unweighted index terms (term vectors) • weighted index terms • … • query • unweighted or weighted index terms • Boolean combinations (or, and, not) • … • search operation must be effective

  5. File Structures • Main requirements • fast-access for various kinds of searches • large number of indices • Alternatives • Inverted Files • Signature Files • PAT trees

  6. Inverted Files • File is represented as an array of indexed records.

  7. Inverted-file process • The record-term array is inverted (transposed).

  8. Inverted-file process (Continued) • Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.Query (term2 and term3)1 1 0 0 0 1 1 1--------------------------------- 1 <-- R2

  9. List-merging for two ordered lists • The inverted-index operations to obtain answers are based on list-merging process. • ExampleT1: {R1, R3}T2: {R1, R2}Merged(T1, T2): {R1, R1, R2, R3}

  10. List-merged Algorithm • Given two input lists of record identifiers in increasing record-number orderif both lists are empty then stop;else if one of the input lists is emptythen transfer onto the output list all items from the other list in order and stop;else take the next item Ri from list 1 and the next item Rj from list 2

  11. if i < j then transfer Ri onto the merged output list and read next item from list 1 before repeating the process; else transfer Rj onto the merged output list and read next item from list 2 before repeating the process

  12. ((T1 or T2) and not T3) T1: {R1, R3} T2: {R1, R2} T3: {R2, R3, R4} Merged(T1, T2): {R1, R1, R2, R3} Output for (T1 or T2): {R1, R2, R3} Merged(T1 or T2, T3): {R1, R2, R2, R3, R3, R4} Output for ((T1 or T2) and T3): {R2, R3} Merged((T1 or T2), ((T1 or T2) and T3)): {R1, R2, R2, R3, R3} Output for ((T1 or T2) and not T3): {R1}

  13. Extensions of Inverted Index Operations(Distance Constraints) • Distance Constraints • (A within sentence B)terms A and B must co-occur in a common sentence • (A adjacent B)terms A and B must occur adjacently in the text

  14. Extensions of Inverted Index Operations(Distance Constraints) • Implementation • include term-location in the inverted indexesinformation: {R345, R348, R350, …}retrieval: {R123, R128, R345, …} • include sentence-location in the indexes information: {R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval: {R123, 5; R128, 25; R345, 37; R345, 40; …}

  15. Extensions of Inverted Index Operations(Distance Constraints) • include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …} • query examples(information adjacent retrieval)(information within five words retrieval) • cost: the size of indexes

  16. Extensions of Inverted Index Operations(Term Weights) • Term WeightsRi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • how to generate the term weights • how to apply the term weights • Sum the weights of all document terms that match the given query. • Rank the output documents in the descending order of term weight.

  17. Boolean Query • Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) • For each conjunct, compute the minimum term weight of any document term in that conjunct. • The document weight is the maximum of all the conjunct weights.

  18. Boolean Query • Example: (T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6)0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1)0.2 0.1 0.2D1 is preferred.

  19. Synonym Specification • Original Query(T1 and T2) or T3Assume S1 is a synonym of T1.Assume S3 is a synonym of T3. • Broader Query((T1 or S1) and T2) or (T3 or S3) • The number of relevant items retrieved may be larger.

  20. Term Truncation • Term Truncation • Remove suffixes and/or prefixes from context terms. • ExamplePSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …

  21. Term Truncation • Implementation • Only suffix truncationConventional inverted-index methodology can be maintained unchanged. • Only prefix truncationThe term entries in inverted index are inversely alphabetized.antisymmetry --> yrtemmysitna

  22. Term Truncation • Both prefix and suffix truncation*SYMM*:antisymmetric,asymmetryinverted-index entries that are alphabetized both forward and backward • infix truncationwom*n woman women inverted index with entries for all possible “rotated” word forms

  23. Term Truncation • Each term entry X=x1, x2, …, xn with individual characters xi is augmented by adding a special terminal character /.ABC ABC/BABC BABC/BCAB BCAB/ • Each augmented term x1, x2, …, xn/ is rotated cyclically by wrapping the term around itself n+1 times.ABC/ /ABC, C/AB, BC/A, ABC/

  24. Term Truncation • Each resulting word form is then augmented by appending a blank character ^. • The resulting file of word forms is sorted alphabetically. ^, /, a, b, c, …, Zlow high

  25. ABC ABC/ /ABC^ /ABC^ C/AB^ /BABC^ BC/A^ /BCAB^ ABC/^ AB/BC^ BABC BABC/ /BABC^ ABC/^ C/BAB^ ABC/B^ BC/AB^ B/BCA^ ABC/B^ BABC/^ BABC/^ BC/A^ BCAB BCAB/ /BCAB^ BC/BA^ B/BCA^ BCAB/^ AB/BC^ C/AB^ CAB/B^ C/BAB^ BCAB/^ CAB/B^

  26. Retrieval Strategies • Query term XLook for index entries /X^ or X/^. • Query term X*Look for /X*. • Query term *XLook for X/^, X/Y1, …, X/Yn.original patterns: X, Y1X, …, YnX • Query term *X*Look for XY1/Z1, …, XYn/Zn.original patterns: Z1XY1, …, ZnXYn

  27. ABC ABC/ /ABC^ /ABC^ *B* C/AB^ /BABC^ BC/A^ /BCAB^ ABC/^ AB/BC^ BABC BABC/ /BABC^ ABC/^ C/BAB^ ABC/B^ BC/AB^ B/BCA^ BCAB ABC/B^ BABC/^ BABC BABC/^ BC/A^ ABC BCAB BCAB/ /BCAB^ BC/BA^ BABC B/BCA^ BCAB/^ BCAB AB/BC^ C/AB^ CAB/B^ C/BAB^ BCAB/^ CAB/B^

  28. Retrieval Strategies • Query term X*YLook for Y/XZ1, …, Y/XZm.Original patterns:XZ1Y, …, XZmY • CostIncrease index entries.

More Related