Information Management

Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al .

Cataloging and Indexing • Why are we discussing this? I don't believe in memorizing a bunch of soon-obsolete facts. I DO believe that many of you will have to solve info-management problems. You will probably invent ways of doing it. So you should "steal from the best" – not reinvent the wheelbarrow. www.joe-ks.com

How do we find things? • By starting in the neighborhood of similar things. • By using the name of the thing, • and asking an "expert" or "resource"

How do we find things? When reading a book: • By starting in the neighborhood of similar things. • By using the name of the thing, • and asking an "expert" or "resource" Look in the table of contents, for an ARTICLE. Look in the index, for a TOPIC.

How do we find things? At the library: • By starting in the neighborhood of similar things. • By using the name of the thing, • and asking an "expert" or "resource" Go to the relevant section, browse shelves. Use the (card) catalog (really an index.)

How do we find things? On the Internet: • By starting in the neighborhood of similar things. • By using the name of the thing, • and asking an "expert" or "resource" Follow links from trusted sources (like cnet). • Use the indexes, e. g. • those provided by search engines • those provided by vendors (eBay, Amazon...) • those provided by facilitators (uTube, craigslist)

What's an index? • An index is a system that serves to optimize speed in finding • relevant documents in a search. • An index is a system that, given one or more search terms from either metadata or essence, efficiently reports the location of the essence. What's fast? What's efficient? here comes some math ... (how we all love it!)

Order statistics A document contains k records. (perhaps k=1000). If you must examine EACH RECORD to find what you seek, the search is Order-k (written as O(k).) For ancient records, this is usually the only way. For instance, the Archivo General de Indias in Seville, Spain www.learningcurve.gov.uk

Order statistics A document contains k records. (perhaps k=1000). If you must examine EACH RECORD to find what you seek, the search is Order-k (written as O(k).) For ancient records, this is usually the only way. On the average, you would look at 500 records (0.5*k) to find the one you are seeking. Let's say we seek a ship named Nuestra Senora de Atocha

Indexing To prepare an index of all ships' names, , captains' names, owners and dates in the archive, it would take O(k) time. Why? Because every document would be visited. Each index item contains SEARCH TERM and DOCUMENT NUMBER BUT now (if the index is sorted, which it is) we can find S=Nuestra Senora de Atocha much faster, by playing "binary search". A sorted index S>this? Z

Indexing If someone prepared an index of all ships, captains' names, owners and dates in the archive, this would take O(k) time. Why? Because every document would be visited. BUT now (if the index is sorted, which it is) we can find S=Nuestra Senora de Atocha much faster, by playing "binary search". A sorted index S>this? no Z

Indexing and binary Search 1 comparison distinguishes 2 records 2 comparison distinguish 4 records 3 comparisons distinguish 8 records ... 10 comparisons distinguish 1024 20 comparisons distinguish over a million records. A sorted index Each comparison cuts in half the search space Z

Indexing and binary Search 1 comparison distinguishes 2 records 2 comparison distinguish 4 records 3 comparisons distinguish 8 records ... 10 comparisons distinguish 1024 20 comparisons distinguish over a million records. O(log k) A sorted index Each comparison cuts in half the search space Z

OMG, a Log? Puleeeeez .... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 21=2 22=2*2=4 23=2*2*2=8 ... 210=2*2*...*2= 1024=1 kilo, about a thousand O(log k) Ten twos

OMG, a Log? Puleeeeez .... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 21=2 22=2*2=4 23=2*2*2=8 ... 210=2*2*...*2=1024 220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million O(log k) Twenty twos

OMG, a Log? Puleeeeez .... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 21=2 22=2*2=4 23=2*2*2=8 ... 210=2*2*...*2=1024 220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million 230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion O(log k) Thirty twos

OMG, a Log? Puleeeeez .... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 21=2 22=2*2=4 23=2*2*2=8 ... 210=2*2*...*2=1024 220= 2*2*...*2 = 1024 * 1024 = 1meg, about a million 230 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion k 2 4 8 1024 1 meg 1 gig (log2 k) 1 2 3 10 20 30

OMG, a Log? Puleeeeez .... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 21=2 22=2*2=4 23=2*2*2=8 ... 210=2*2*...*2=1024 220= 1 meg 230=1 gig You need to be able to tell me what is log2(k) for any k (power of two) between 1 and 1meg. Example: 256k? 256=28. and 1k~=210. So that's 2*2*2..*2 18 twos log2(256k) = 18

OMG, a Log? Puleeeeez .... I will provide a Logarithm Practice Sheet on the website to help you study and practice for the midterm exam.

Indexing and binary Search Linear Search Binary Search 1000 items 10 steps 1 million items 20 steps 1 billion items 30 steps O(log2 k) A sorted index Each comparison cuts in half the search space Z

Sorting N Objects We will discuss sorting, a bit later After you recover from Math Anxiety Slcc.edu -21 -

Why not just keep books in order? Could you do 'binary search' directly on the books ...? • Well, WHICH order? If they're on the shelf in that order, yes. • by ship names? • by captains' names? • by year of construction? • by year of sinking or decommissioning? • An index can be sorted on any data field, then searched. • (Sorting k objects takes O(k * log k) time • (so sorting a billion objects; 1 billion * log2(1 billion) • =1 billion* 30 = 30 billion steps)

Why not just keep books in order? • An index can be sorted on any data field, then searched. • (Sorting k objects takes O(k log k) time • (so sorting a billion objects; 1 billion * log2(1 billion) • =1 billion* 30 = 30 billion steps) • (This can be done overnight, when computers aren't busy) • BUT – once sorted, inserting new information is O(log k) time. • So, you can insert a new fact into our billion-item index in • about 30 steps. Fast!

What terms shall we index? • For text, the essence yields keyword search • The dumbest but easiest kind of search, if essence=digital text.

What terms shall we index? • For text, the essence yields keyword search • The dumbest but easiest kind of search, if essence=digital text. • This was not true for traditional libraries. • - Nobody had time to catalog every word of every book. • - Professional catalogers had to develop techniques: • - Author • - Title • - Publication Date • - Subject • (METADATA!) • And this last one, Subject, took more work than all the rest together.

What's so hard about subject indexing? • The problem: restricting the vocabulary. • Let's consider a fictional book: • The Skills of a Nineteenth Century Bartender. • Henry Macintosh, New York, 1889 How might someone seek this book? Or: what metadata fields might the librarian use? Occupations: bartender, barkeeper, barman, barkeep (Are there others we forgot to search for?) So catalogers established rules involving precedent to restrict vocabularies and establish standards

Cataloging an Item for a Library The card catalog at Yale University (of course, it's all computerized now)

Cataloging an Item for a Library • Problem #1: What book (or other object) are we talking about? • - Each item has an accession number (that's easy to issue) • - Each title has a catalog number, shared with all instances • (sometimes separate copies are called .c1, .c3 etc.) • Problem #2: What catalog number should I give this item? • Did someone else catalog it already? If so, use that. • If not, follow the • International Standard Bibliographic Description (ISBD) -28 -

International Standard Bibliographic Description (ISBD) • Title • statement of responsibility (author or editor), • edition, • material specific details (for example, the scale of a map), • publication and distribution, • physical description (for example, number of pages), • Series (e. g. this might be part 3 of a trilogy) • notes, • standard number (ISBN).

And then follow • A complex set of rules • Most English cataloging follows • Anglo-American Cataloging Rules (AACR2) • Germans follow • Regeln für die alphabetische Katalogisierung • Etc… -30 -

How to organize an index • Step 1: Deciding what fields to include • (the Ontology) of the subject space • Step 2: Deciding if each metadata field is open or controlled (CV). • Open set: American family names • Closed set: Chinese family names • In software, ,CV fields are often presented as pulldown menus. • Step 3: Establishing the controlled • vocabulary, and rules for • extending it. • Step 4: Maintaining it. • (e. g. MIME types, subtypes.) http://www.kksou.com

Concept: "Low-hanging fruit" • In any new domain, some ideas will come together • that present opportunities not previously possible • Some of them will be easy to do. • Get these first, and you may be rich. • The cataloging of dynamic media such as • video can take advantage of techniques • for Content Logging. • In this area, • closed captions was a low-hanging • fruit. www.recipeforlowhangingfruit.com

Closed Captions for Content Logging • Originally for deaf ... now for bars, etc. • "Closed" – not all viewers will see the captions • But they are built into most TV broadcasts. • >> Indicates a new speaker has begun to talk. www.recipeforlowhangingfruit.com

Closed Captions for TV • Originally for deaf ... now for bars, etc. • "Closed" – not all viewers will see the captions • But they are built into most TV broadcasts. • >> Indicates a new speaker has begun to talk. • But – isn't speech recognition still hard? • - yes – but there are SCRIPTS and TELEPROMPTERS behind • most TV programming. Live news feeds are a mix of scripted • and unscripted. • BBC developed a re-speak technology to maximize clarity. • Sound effects and music are shown by # or notes. www.recipeforlowhangingfruit.com

Closed Captions for TV • now that CC exists, you can index it to produce metadata. • Services monitor in real-time for significant stories. www.recipeforlowhangingfruit.com

Can you think of another TV "LHF"? Where is another source of already-in-text-form metadata about TV program contents? (I can think of two). www.recipeforlowhangingfruit.com

Can you think of another TV "LHF"? • Where is another source of already-in-text-form metadata • about TV program contents? (I can think of two). • Electronic Program Guides, such as • Tivo's TV programming schedule • Broadcasters' Websites (e. g. www.cbs.com)

We've discussed third party logging • But what about in-house logging (by materials' own producers.) • Static metadata (exists independently of the essence) • Production Notes, including original scripts • Edit Decision List (part of production notes) • Advanced Authoring Format (AAF) • News Feed rundowns (cues for local broadcasters) • Media Object Server (MOS) format

We've discussed third party logging • But what about in-house logging (by materials' own producers.) • Dynamic metadata (sampled from or derived from the essence) • A hierarchy of proxy representations: • time code (ties it all together) • Proxy video (low res, maybe easier to scan – or harder!) • Keyframes (still images for pattern recognition) • Audio transcript • annotation – added by staff

Speech Analysis • Phoneme: minimal meaningful unit of speech. English has 44. • Phone: the 'rendering' of a phoneme by an individual. Infinite # • Recognition of words: difficult under good conditions, • nearly impossible under noisy conditions • However, you don't need to get ALL the words to make the • document searchable. Even getting SOME of the words is better • than none. • www.nuance.com

Indexing things that aren't words • Built-in metadata (e. g. digital camera data, Adobe metadata) • Image libraries – cataloged by human beings • (We will study some of the metadata standards used.) • Automatic pattern recognition • http://www.autonomy.com/content/Solutions/video-surveillance/index.en.html • Assignment: Download ONE of the "Autonomy Virage" documents, • read it and be prepared to give a one-minute summary of its claims.

Recognizing Faces • FINDING a face in a scene is far easier than RECOGNIZING it. • Nikon's cameras can now find faces and focus on them. • Face-priority AF in Nikon CoolpixCameras • But it's a rough rough world out there. The website listed • below provides a list of vendors ... many of which are 'dead • links' as companies come and go. http://www.face-rec.org/vendors//

And ... where do we go from here? Go back through these slides. Make a list of the important words. If you can write a one-sentence explanation of every word on this list, AND answer logarithm questions, you're ready for the midterm. ... at least with regard to Searching and Pattern Recognition. But now let's go talk about SORTING.

Sorting • Why are we discussing this? It's a good example of DUMB vs. SMART algorithms. What's an algorithm? A systematic procedure for solving a problem. Programs are built on the basis of algorithms. But so are * carpentry * medical diagnosis * electronic repair .. Etc etc etc .

Sorting and Ignorance • Two thousand name-tags • Printed in NAME order • Needed in COMPANY order • So… they put • Six temps to • Work … • For HOURS… . Mnddc.org -45 -

Sorting the Hard Way • Spread 'em all on a long table • Insert each one into the ordered pile. • Problem: The pile gets bigger and bigger, • so the insertion goes more & more slowly. . -46 - -46 -

Sorting the Hard Way • Spread 'em all on a long table • Insert each one into the ordered pile. • This technique takes O(n2) – that's n squared. • 2000 * 2000 = 4 million operations! • Walk down the row (pass n badges), insert one. • Do this n times. You have n * n distance to walk. . -47 - -47 - -47 -

Sorting, a smart way • 1. Grab 20 badges, and sort them in a small group. • Create 100 small, sorted batches. • 2. Combine the batches 2 by 2, like this: • 20 40 • 20 80 etc. • 20 40 • 20 . -48 - -48 - -48 - -48 -

Sorting, a smart way • 2. Combine the batches 2 by 2, like this: • 20 40 • 20 80 etc. • 20 40 • 20 • Reminds you of binary search? Yes, • Merging twice as many groups only takes • One more step (layer). • 4 groups – 2 layers (3 operations) • 8 groups – 3 layers (7 operations) etc. . -49 - -49 - -49 - -49 - -49 -

Sorting by 'merge-sort' Merge-Sort requires O(n log2 n) operations to sort n objects. For 2000 name badges, log2 (2000) = log2 (1000) + 1 You recognize log2 (1000) ~= log2 (1k) = 10, So log2 (2000) ~= 11 So our total estimate for sorting 2000 name badges is Approximately 2000 * 11 or 22,000 steps Compared to 4 million steps (2000 * 2000) if doing the job the BFI (Brute Force & Ignorance) way! -50 - -50 - -50 - -50 - -50 - -50 -

Information Management