260 likes | 393 Views
This chapter explores the fundamentals of information retrieval (IR), focusing on the specific user need for NCAA college tennis team information. It outlines the essential steps in effective searching, including document representation, filtering, and classification, while distinguishing between information and data retrieval. The narrative emphasizes the importance of accurate team rankings and coach contact details. Additionally, it discusses IR history, conventional text-retrieval systems, and advanced topics such as inverted indexing and term weighting, offering insights for practitioners and researchers alike.
E N D
Modern Information RetrievalChapter 1: Introduction Ricardo Baeza-Yates Berthier Ribeiro-Neto
Motivation • Example of the user information need • Topic: NCAA college tennis team • Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. • Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.
IR Research • Information retrieval vs Data retrieval • Research • information search • information filtering (routing) • document classification and categorization • user interfaces and data visualization • cross-language retrieval
IR History • 1970 • 1990, WWW
The User Task • Retrieval (Searching) • classic information search process where clear objectives are defined • Browsing • a process where one’s main objectives are not clearly defined and might change during the interaction with the system
Logical View of the Documents • Text Operations • reduce the complexity of the document representation • a full text a set of index terms • Steps 1. Stopwords removing 2. Stemming 3. Noun groups 4. ...
Past, Present, and Future • Early Development • Index • Library • Author name, title, subject headings, keywords • The Web and Digital Libraries • Hyperlinks
Resources • Journals • Journal of American Society of Information Sciences • ACM Transactions on Information Systems • Information Processing and Management • Information Systems (Elsevier) • Knowledge and Information Systems (Springer) • Conferences • ACM SIGIR, DL, CIKM, CHI, etc. • Text Retrieval Conference (TREC)
Conventional Text-Retrieval SystemsAutomatic Text Processing G. Salton, Addison-Wesley, 1989. (Chapter 9)
Data Retrieval • A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) • Exact match between the attributes used inquery formulationsandthose attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’
Text-Retrieval Systems • Content identifiers (keywords, index terms, descriptors) characterize the stored texts. • Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation
Possible Representation • Document representation • unweighted index terms (term vectors) • weighted index terms • … • Query • unweighted or weighted index terms • Boolean combinations (or, and, not) • … • Search operation must be effective
File Structures • Main requirements • fast-access for various kinds of searches • large number of indices • Alternatives • Inverted Files • Signature Files • PAT trees
Inverted Files • File is represented as an array of indexed documents.
Inverted-file process • The document-term array is inverted (transposed).
Inverted-file process (Continued) • Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. • Ex: Query= (term2 and term3) term21 1 0 0term3 0 1 1 1------------------------------------------------------ 1 <-- D2
List-merging for two ordered lists • The inverted-index operations to obtain answers are based on list-merging process. • ExampleT1: {D1, D3}T2: {D1, D2}Merged(T1, T2): {D1, D1, D2, D3}
Extensions of Inverted Index Operations(Distance Constraints) • Distance Constraints • (A within sentence B)terms A and B must co-occur in a common sentence • (A adjacent B)terms A and B must occur adjacently in the text
Extensions of Inverted Index Operations(Distance Constraints) • Implementation • include term-location in the inverted indexesinformation: {P345, P348, P350, …}retrieval: {P123, P128, P345, …} • include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …}retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}
Extensions of Inverted Index Operations(Distance Constraints) • Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …} • Query examples(informationadjacentretrieval)(informationwithin five wordsretrieval) • Cost: the size of indexes
Term Weights • Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6} • Issues • How to generate the term weights? • How to apply the term weights? • Sum the weights of all document terms that match the given query. • Rank the output documents in the descending order of term weight.
Boolean Query with Term Weights • Transform a Boolean expression into disjunctive normal form.T1 and (T2 or T3) = (T1 and T2) or (T1 and T3) • For each conjunct, compute the minimum term weight of any document term in that conjunct. • The document weight is the maximum of all the conjunct weights.
Boolean Query with Term Weights • Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6) 0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1) 0.2 0.1 0.2D1 is preferred.
Stemming • Term Truncation • Remove suffixes and/or prefixes from context terms. • ExamplePSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …