Conventional Text-Retrieval Systems

Conventional Text-Retrieval Systems Hsin-Hsi Chen

Database Management • A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) • Exact match between the attributes used inquery formulations and those attached to the record. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’

Text-Retrieval Systems • Content identifiers (keywords, index terms, descriptors) characterize the stored texts. • degrees of coincidence between the sets of identifiers attached to queries and documents query formulation content analysis

Possible Representation • document representation • unweighted index terms (term vectors) • weighted index terms • … • query • unweighted or weighted index terms • Boolean combinations (or, and, not) • … • search operation must be effective

File Structures • Main requirements • fast-access for various kinds of searches • large number of indices • Alternatives • Inverted Files • Signature Files • PAT trees

Inverted Files • File is represented as an array of indexed records.

Inverted-file process • The record-term array is inverted (transposed).

Inverted-file process (Continued) • Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.Query (term2 and term3)1 1 0 0 0 1 1 1--------------------------------- 1 <-- R2

List-merging for two ordered lists • The inverted-index operations to obtain answers are based on list-merging process. • ExampleT1: {R1, R3}T2: {R1, R2}Merged(T1, T2): {R1, R1, R2, R3}

List-merged Algorithm • Given two input lists of record identifiers in increasing record-number orderif both lists are empty then stop;else if one of the input lists is emptythen transfer onto the output list all items from the other list in order and stop;else take the next item Ri from list 1 and the next item Rj from list 2

if i < j then transfer Ri onto the merged output list and read next item from list 1 before repeating the process; else transfer Rj onto the merged output list and read next item from list 2 before repeating the process

((T1 or T2) and not T3) T1: {R1, R3} T2: {R1, R2} T3: {R2, R3, R4} Merged(T1, T2): {R1, R1, R2, R3} Output for (T1 or T2): {R1, R2, R3} Merged(T1 or T2, T3): {R1, R2, R2, R3, R3, R4} Output for ((T1 or T2) and T3): {R2, R3} Merged((T1 or T2), ((T1 or T2) and T3)): {R1, R2, R2, R3, R3} Output for ((T1 or T2) and not T3): {R1}

Extensions of Inverted Index Operations(Distance Constraints) • Distance Constraints • (A within sentence B)terms A and B must co-occur in a common sentence • (A adjacent B)terms A and B must occur adjacently in the text

Extensions of Inverted Index Operations(Distance Constraints) • Implementation • include term-location in the inverted indexesinformation: {R345, R348, R350, …}retrieval: {R123, R128, R345, …} • include sentence-location in the indexes information: {R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval: {R123, 5; R128, 25; R345, 37; R345, 40; …}

Extensions of Inverted Index Operations(Distance Constraints) • include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …} • query examples(information adjacent retrieval)(information within five words retrieval) • cost: the size of indexes

Extensions of Inverted Index Operations(Term Weights) • Term WeightsRi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • how to generate the term weights • how to apply the term weights • Sum the weights of all document terms that match the given query. • Rank the output documents in the descending order of term weight.

Boolean Query • Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) • For each conjunct, compute the minimum term weight of any document term in that conjunct. • The document weight is the maximum of all the conjunct weights.

Boolean Query • Example: (T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6)0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1)0.2 0.1 0.2D1 is preferred.

Synonym Specification • Original Query(T1 and T2) or T3Assume S1 is a synonym of T1.Assume S3 is a synonym of T3. • Broader Query((T1 or S1) and T2) or (T3 or S3) • The number of relevant items retrieved may be larger.

Term Truncation • Term Truncation • Remove suffixes and/or prefixes from context terms. • ExamplePSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …

Term Truncation • Implementation • Only suffix truncationConventional inverted-index methodology can be maintained unchanged. • Only prefix truncationThe term entries in inverted index are inversely alphabetized.antisymmetry --> yrtemmysitna

Term Truncation • Both prefix and suffix truncation*SYMM*:antisymmetric,asymmetryinverted-index entries that are alphabetized both forward and backward • infix truncationwom*n woman women inverted index with entries for all possible “rotated” word forms

Term Truncation • Each term entry X=x1, x2, …, xn with individual characters xi is augmented by adding a special terminal character /.ABC ABC/BABC BABC/BCAB BCAB/ • Each augmented term x1, x2, …, xn/ is rotated cyclically by wrapping the term around itself n+1 times.ABC/ /ABC, C/AB, BC/A, ABC/

Term Truncation • Each resulting word form is then augmented by appending a blank character ^. • The resulting file of word forms is sorted alphabetically. ^, /, a, b, c, …, Zlow high

ABC ABC/ /ABC^ /ABC^ C/AB^ /BABC^ BC/A^ /BCAB^ ABC/^ AB/BC^ BABC BABC/ /BABC^ ABC/^ C/BAB^ ABC/B^ BC/AB^ B/BCA^ ABC/B^ BABC/^ BABC/^ BC/A^ BCAB BCAB/ /BCAB^ BC/BA^ B/BCA^ BCAB/^ AB/BC^ C/AB^ CAB/B^ C/BAB^ BCAB/^ CAB/B^

Retrieval Strategies • Query term XLook for index entries /X^ or X/^. • Query term X*Look for /X*. • Query term *XLook for X/^, X/Y1, …, X/Yn.original patterns: X, Y1X, …, YnX • Query term *X*Look for XY1/Z1, …, XYn/Zn.original patterns: Z1XY1, …, ZnXYn

ABC ABC/ /ABC^ /ABC^ *B* C/AB^ /BABC^ BC/A^ /BCAB^ ABC/^ AB/BC^ BABC BABC/ /BABC^ ABC/^ C/BAB^ ABC/B^ BC/AB^ B/BCA^ BCAB ABC/B^ BABC/^ BABC BABC/^ BC/A^ ABC BCAB BCAB/ /BCAB^ BC/BA^ BABC B/BCA^ BCAB/^ BCAB AB/BC^ C/AB^ CAB/B^ C/BAB^ BCAB/^ CAB/B^

Retrieval Strategies • Query term X*YLook for Y/XZ1, …, Y/XZm.Original patterns:XZ1Y, …, XZmY • CostIncrease index entries.

Conventional Text-Retrieval Systems