CS 430: Information Discovery

CS 430: Information Discovery Lecture 19 Non-Textual Materials 1

Course Administration Discussion classes • Attend! • Speak! Midterm Examination Mail has been sent to everybody. Contact cs430 if you have any outstanding questions. Assignment 2 Mail has been sent to everybody. Apologies for four people who were asked to resubmit. Contact cs430 if you have any outstanding questions.

Course Administration Submission of Programs Extra instructions have been added to the Assignments web page. The graders will run your programs from a command prompt. Your program must run from there with no additional work from the graders. Your ReadMe file must specify any command line parameters. If you have any questions, send email to cs430@cs.cornell.edu.

Midterm Examination: Question 1 (a) Define the terms recall and precision. (b) D is a collection of 1,000,000 documents. Q is a query. When the query Q is run using Boolean retrieval, a set of 200 documents is returned. Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant. (i) What is the precision? (ii) What is the recall?

Midterm Examination: Question 1 (continued) (c) D is a collection of 1,000,000 documents. Q is a query. When the query Q is run using ranked retrieval, a set of 500 documents is returned, ranked by the search system. (i) Explain the function of a recall-precision graph in representing this data. For each of k = 1 to 500, consider the set consisting of the k most highly ranked documents. Calculate the recall and precision for this set. The recall-precision graph is a plot of these 500 points. [Note that recall must increase monotonically with k. It is useful to label the points to indicate k and to link them in order to trace how recall and precision vary with k.]

Recall-precision graph(Lecture 8, Slide 17) recall 200 1.0 13 6 12 0.75 5 4 0.5 2 3 0.25 1 precision 0.25 0.5 1.0 0.75

Midterm Examination: Question 1 (continued) (ii) In a practical experiment, how would you estimate the precision? (iii) In a practical experiment, how would you estimate the recall? (i) To estimate the precision, have an expert examine each of the 500 hits and decide which are relevant. (ii) It is impractical to examine 1,000,000 documents. Therefore some form of sampling is needed. [Options are statistical sampling of entire set, or (TREC-like) generate a candidate set by searching with many aeries and methods.]

Midterm Examination: Question 2 (a) The aggregate term weighting is sometimes written: w = tf * idf. Explain the purpose of tf and of idf. Term frequency (tf) assumes that a term that appears frequently in a document is a good discriminator and should be given greater weight. Inverse document frequency (idf) assumes that a term that appears in many documents is a poor discriminator and should be given less weight.

Midterm Examination: Question 2(Notation) wikis the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k N is the total number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection

Midterm Examination: Question 2(continued) (b) In class, we first introduced Salton's original term weighting: wik = fik * N / dk Later, we discussed Sparck Jones's version of the Inverse Document Frequency: idfi = log2 (maxn / ni)+ 1 What is the relationship between these alternatives? Sparck Jones's version of inverse document frequency (N / dk) gives less than proportional weight to a term that appears many times. This is particularly helpful when the documents vary in length. Her method also uses ni, the total number of occurrences, not dk, the number of documents in which a term occurs.

Midterm Examination: Question 2(continued) (c) Consider the query: Q:cat cat elk and the following set of documents: D1: bee cat dog bee cat bee fox elk D2: bee cat dog hog hog dog ant (i) With no term weighting, what is the similarity between this query and each of the documents? (ii) Using Salton's original form of term frequency, but not weighting for inverse document frequency, what is the similarity between this query and each of the documents?

Midterm Examination: Question 2(no term weighting) terms ant bee cat dog elk fox hog length Q 1 1 2 D1 1 1 1 1 1 5 D2 1 1 1 1 1 5 S(Q, D1) = (1.1 + 1.1) / (2 . 5) = 2 / 10 S(Q, D2) = (1.1) / (2 . 5) = 1 / 10

Midterm Examination: Question 2(no tf weighting) terms ant bee cat dog elk fox hog length Q 2 1 5 D1 3 2 1 1 1 16 = 4 D2 1 1 1 2 2 11 S(Q, D1) = (2.2 + 1.1) / (5 . 4) = 5 / 45 S(Q, D2) = (2.1) / (5 . 11) = 2 / 55

Midterm Examination: Question 3 The following figure is taken from: Scott Deerwester, et al., "Indexing by latent semantic analysis". [See next slide for figure.] It shows a two-dimension plot of 12 terms, 9 documents (labeled c1-c5 and m1-m4), and a query, q.

Midterm Examination: Question 3(continued) (a) In this figure, the axes are labeled "Dimension 1" and "Dimension 2". What do these dimensions represent? They are intended to represent concepts. (b) Explain how this graph can be used to measure how near document c2 is to the query q. Each point on the graph represents the end point of a vector from the origin to the point. The cosine of the angle between the vectors joining c2 to the origin and q to the origin is used to measure their similarity.

Midterm Examination: Question 3(continued) (c) The dotted lines are described as, "The dotted cone represents the region whose points are within a cosine of 0.9 from the query q." All the documents labeled c1-c5 are within this cone, but none of the documents labeled m1-m4. What does this imply? The angles between the vectors representing q and each of c1 to c5 are small (cosine > 0.9). Hence they represent similar concepts to q. The angles between the vectors representing q and each of m1 to m5 are large (cosine < 0.9). Hence they represent different concepts from q.

Examples of Non-textual Materials Content Attribute maps lat. and long., content photograph subject, date and place bird songs and images field mark, bird song software task, algorithm data set survey characteristics video subject, date, etc.

Possible Approaches to Information Discovery for Non-text Materials Human indexing Manually created metadata records Automated information retrieval Automatically created metadata records (e.g., image recognition) Context: associated text, links, etc. (e.g., Google image search) Multimodal: combine information from several sources User expertise Browsing: user interface design

Surrogates Surrogates for searching • Catalog records • Finding aids • Classification schemes Surrogates for browsing • Summaries (thumbnails, titles, skims, etc.)

Catalog Records for Non-Textual Materials • General metadata standards, such as Dublin Core and MARC, can be used to create a textual catalog record of non-textual items. • Subject based metadata standards apply to specific categories of materials, e.g., FGDC for geospatial materials. • Text-based searching methods can be used to search these catalog records.

Automated Creation of Metadata Records Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field. Examples • Images -- characteristics of color, texture, shape, etc. (crude) • Music -- optical recognition of score (good) • Bird song -- spectral analysis of sounds (good) • Fingerprints (good)

Image Retrieval: Blobworld

Example 1: Photographs Photographs in the Library of Congress's American Memory collections In American Memory, each photograph is described by a MARC record. The photographs are grouped into collections, e.g., The Northern Great Plains, 1880-1920: Photographs from the Fred Hultstrand and F.A. Pazandak Photograph Collections Information discovery is by: • searching the catalog records • browsing the collections

Photographs: Cataloguing Difficulties Automatic • Image recognition methods are very primitive Manual • Photographic collections can be very large • Many photographs may show the same subject • Photographs have little or no internal metadata (no title page) • The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

Photographs: Difficulties for Users Searching • Often difficult to narrow the selection down by searching -- browsing is required • Criteria may be different from those in catalog (e.g., graphical characteristics) Browsing • Offline. Handling many photographs is tedious. Photographs can be damaged by repeated handling • Online. Viewing many images can be tedious. Screen quality may be inadequate.

Example 2: Mathematical Software Netlib • A digital library that of mathematical software (Jack Dongarra and Eric Grosse). • Exchange of software in numerical analysis, especially for supercomputers with vector or parallel architectures. • Organization of material assumes that users are mathematicians and scientists who will incorporate the software into their own computer programs. • The collections are arranged in a hierarchy. The editors use their knowledge of the specific field to decide the method of organization.

GAMS: Guide to Available Mathematical Software

Multimedia 3: Geospatial Information Example: Alexandria Digital Library at the University of California, Santa Barbara • Funded by the NSF Digital Libraries Initiative since 1994. • Collections include any data referenced by a geographical footprint. terrestrial maps, aerial and satellite photographs, astronomical maps, databases, related textual information • Program of research with practical implementation at the university's map library

Alexandria User Interface

Alexandria: Computer Systems and User Interfaces • Computer systems • Digitized maps and geospatial information -- large files • Wavelets provide multi-level decomposition of image • -> first level is a small coarse image • -> extra levels provide greater detail • User interfaces • Small size of computer displays • Slow performance of Internet in delivering large files • -> retain state throughout a session

Alexandria: Information Discovery • Metadata for information discovery • Coverage: geographical area covered, such as the city of Santa Barbara or the Pacific Ocean. • Scope: varieties of information, such as topographical features, political boundaries, or population density. • Latitude and longitude provide basic metadata for maps and for geographical features.

Gazetteer • Gazetteer: database and a set of procedures that translate representations of geospatial references: • place names, geographic features, coordinates • postal codes, census tracts • Search engine tailored to peculiarities of searching for place names. • Research is making steady progress at feature extraction, using automatic programs to identify objects in aerial photographs or printed maps -- topic for long-term research.

Collections: Finding Aids and the EAD Finding aid • A list, inventory, index or other textual document created by an archive, library or museum to describe holdings. • May provide fuller information than is normally contained within a catalog record or be less specific. • Does not necessarily have a detailed record for every item. The Encoded Archival Description (EAD) • A format (XML DTD) used to encode electronic versions of finding aids. • Heavily structured -- much of the information is derived from hierarchical relationships.

Collection-Level Metadata Collection-level metadata is used to describe a group of items. For example, one record might describe all the images in a photographic collection. Note: There are proposals to add collection-level metadata records to Dublin Core. However, a collection is not a document-like object.

Collection-Level Metadata

Data Mining • Extraction of information from online data. • Not a topic of this course.

CS 430: Information Discovery