1 / 40

FRBR: Algorithms and Applications

FRBR: Algorithms and Applications. T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004. Outline. Algorithms FRBR work matching Handling author-title variants Hardware Beowulf cluster Applications Bookmarklets FictionFinder Future directions.

Samuel
Download Presentation

FRBR: Algorithms and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004

  2. Outline • Algorithms • FRBR work matching • Handling author-title variants • Hardware • Beowulf cluster • Applications • Bookmarklets • FictionFinder • Future directions

  3. Working with Group 1 Entities WEMI: Work Expression Manifestation Item • Strict expression-level determination is hard • We primarily divide by language • Manifestation is easier • We use the WorldCat master record

  4. Work Identification • Algorithm goals: • Efficient • Understandable • Controllable by catalogers • Uses existing WorldCat records

  5. The Algorithm • A key is generated for each record • Extract author, title • Look up in LC name authority file • Added entry information as needed • Form a key from bibliographic record • Author, title, added entry information • These can be sorted, compared

  6. Example

  7. Example (with authorities)

  8. More Detail • Extract author names • Look up in authority file • Currently only personal names • Subfields $abcdq • Extract title • Always use uniform titles if present • Look up author/short title (~$a) • Look up author/long title (~$abfgnp) • Prefer alternative title for non-English • Create key from author/title • Always do NACO normalization (has limitations) • Add information for uncontrolled title-main-entry

  9. Authority Files Rule! • Authors • Author/titles • Bring together variations • Allow override in difficult cases • Both splitting and joining groups • Especially important with xISBN matching • Especially important with non-English metadata

  10. Limitations of the Authority File • What’s missing: • Many uniform titles • Many author variants • Many title variants • Language of heading • Partial solution • Create auxiliary files of mechanically generated matches

  11. Results of FRBR Matching on WorldCat • 88% of manifestations are ‘singletons’ • 30% of manifestations are in 12% of the works • Average size of multiple matches: 3.1 manifestations/work • 43.1 million works in 54 million manifestations • 54% of holdings on a FRBR work with >1 manifestation • WorldCat manifestations average about 20 holdings • FRBR helps where help is most needed

  12. More FRBR Results • 310,000 works have more than 5 manifestations • 1.7 million have more than 2 manifestations • Largest: 30,000+ for the Bible • 1,537 Shakespeare’s Macbeth • 1,026 Dickens’s Christmas Carol

  13. The Top 10 Works by Holdings

  14. The Top 10 Works Cataloged in 2003

  15. Top 1000 Publication Dates

  16. Top 1000 Languages

  17. Our Beowulf Cluster • 24 Nodes • Each with 2x2.6 GHz processors • 4 GBytes memory (96 GBytes total) • One ‘head’ node, 23 ‘compute’ nodes • 46x40 GBytes disk (~2 Terabytes total) • Gigabit switch

  18. What we are using it for • All our bibliographic processing • FRBR • Extractions • Searching • Matching

  19. Ganglia load visualization

  20. Starting point • FRBR key generation • 25 hours on a 3.00GHz workstation with 2GB of RAM • Generate two key files • sort by key, uniq by key, sort by occurrence • sort by key, post processing on keys, uniq by key, sort by occurrence • Merge key files

  21. FRBR on the Cluster • 44 minutes on the cluster • 69 key builders & 23 sort buckets with hyperthreading ON • Generate 23 radix-sorted, post-processed key files • Collapse and sort by occurrence in parallel • Also outputs additional files used by other jobs

  22. Application: Preservation • Identify ‘final copy’ items • Do it at the work level • Single-singles • Single manifestations with single holding • Found 18 million in WorldCat

  23. Application: xISBN • A simple Web service • Given an ISBN: • Identify the workset it is in • Return all other ISBNs in that workset • Results should be symmetrical! • Same group retrieved for each ISBN in group • ISBNs sorted by number of library holdings

  24. xISBN Example http://labs.oclc.org/xisbn/0-19-281664-0 returns: <?xml version="1.0" encoding="UTF-8" ?> <idlist> <isbn>0192816640</isbn> <isbn>0820312037</isbn> <isbn>0820315370</isbn> <isbn>0393015920</isbn> <isbn>0393952274</isbn> <isbn>0393952835</isbn> <isbn>0140430210</isbn> <isbn>0192811320</isbn> <isbn>0192835947</isbn> <isbn>0460872885</isbn> <isbn>1853262706</isbn> <isbn>0874131219</isbn> </idlist>

  25. Matching on ISBNs • ISBN additional information beyond Author/Title • Allows relaxation of matching • Introduces possible errors • Offers the possibility of substantial improvement of work matching

  26. Merging Worksets Using ISBN Matches • Pair ISBNs with FRBR keys (Starts with 10 million ISBNs) • Throw out ISBNs in single worksets • Throw out ISBNs in > 5 worksets (We now have 561,000 ISBNs left) • Are the titles similar enough? • Throw out large groups • Try to be very conservative • Authority file always overrides other matching

  27. Matches from ISBN Matching • 74,000 author variants • ~200,000 title variants • These all create additional cross reference records • Automatically folded into FRBR matching • Kept separate from NACO file • Only used in research at this time

  28. Examples of Possible Matches • /mcgraw hill encyclopedia of science & technology • /mcgraw hill encyclopedia of science & technology\1\aar aor • /mcgraw hill encyclopedia of science & technology\2\apa boo • /mcgraw hill encyclopedia of science & technology\3\bor cle • /mcgraw hill encyclopedia of science & technology\4\cli cyt • … • dickens, charles\1812 1870/tale of two cities • dickens, charles\1812 1870/hard times • dickens, charles\1812 1870/sketches by boz • dickens, charles\1812 1870/martin chuzzlewit • dickens, charles\1812 1870/bleak house • dickens, charles\1812 1870/little dorrit • dickens, charles\1812 1870/oliver twist • …

  29. Application: Bookmarklets

  30. Clicking on Princeton

  31. FictionFinder • Indexes fiction from WorldCat • Uses FRBR workset algorithm • Focused on fiction • Searching and browsing by • Genre • Fictitious Characters • Imaginary Places • Literary Forms • Links to • Google • Open WorldCat • Diane Vizine-Goetz’s project

  32. ‘Humphry Clinker’ Search

  33. Work Display

  34. Detail of Language Display

  35. First Few English Manifestations

  36. Manifestation Display

  37. Open WorldCat Link

  38. Additional Matches • Match variant titles: • When the wind blows • When the wind blows: a novel • FictionFinder identified 10,000 of similar variations • novela, novella, roman, … • Created auxiliary authority records • Now automatically used when FRBR algorithm is run

  39. Future • Continued development of FictionFinder • Extending algorithm to serials? • FirstSearch displays • Additional matching criteria • Local authority files? • Integration of auxiliary files for production? • Exploring FRBRizing some European catalogs • Looking at extending beyond Roman characters

  40. Links • IFLA FRBR - Final Report • http://www.ifla.org/VII/s13/frbr/frbr.htm • Article in DLib • http://www.dlib.org/dlib/september02/hickey/09hickey.html • OCLC Research Activities with FRBR • http://www.oclc.org/research/projects/frbr/ • FictionFinder • http://fictionfinder.oclc.org/ • Top 1000 • http://www.oclc.org/research/top1000/

More Related