frbr algorithms and applications
Download
Skip this Video
Download Presentation
FRBR: Algorithms and Applications

Loading in 2 Seconds...

play fullscreen
1 / 40

frbr: algorithms and applications - PowerPoint PPT Presentation


  • 243 Views
  • Uploaded on

FRBR: Algorithms and Applications. T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004. Outline. Algorithms FRBR work matching Handling author-title variants Hardware Beowulf cluster Applications Bookmarklets FictionFinder Future directions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'frbr: algorithms and applications' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
frbr algorithms and applications

FRBR: Algorithms and Applications

T. Hickey

J. Toves

D. Vizine-Goetz

Online Compuer Library Center

CLA November 2004

outline
Outline
  • Algorithms
    • FRBR work matching
    • Handling author-title variants
  • Hardware
    • Beowulf cluster
  • Applications
    • Bookmarklets
    • FictionFinder
  • Future directions
working with group 1 entities
Working with Group 1 Entities

WEMI:

Work

Expression

Manifestation

Item

  • Strict expression-level determination is hard
    • We primarily divide by language
  • Manifestation is easier
    • We use the WorldCat master record
work identification
Work Identification
  • Algorithm goals:
    • Efficient
    • Understandable
    • Controllable by catalogers
    • Uses existing WorldCat records
the algorithm
The Algorithm
  • A key is generated for each record
  • Extract author, title
    • Look up in LC name authority file
    • Added entry information as needed
  • Form a key from bibliographic record
    • Author, title, added entry information
    • These can be sorted, compared
more detail
More Detail
  • Extract author names
    • Look up in authority file
      • Currently only personal names
      • Subfields $abcdq
  • Extract title
    • Always use uniform titles if present
    • Look up author/short title (~$a)
    • Look up author/long title (~$abfgnp)
    • Prefer alternative title for non-English
  • Create key from author/title
    • Always do NACO normalization (has limitations)
    • Add information for uncontrolled title-main-entry
authority files rule
Authority Files Rule!
  • Authors
  • Author/titles
  • Bring together variations
  • Allow override in difficult cases
    • Both splitting and joining groups
    • Especially important with xISBN matching
  • Especially important with non-English metadata
limitations of the authority file
Limitations of the Authority File
  • What’s missing:
    • Many uniform titles
    • Many author variants
    • Many title variants
    • Language of heading
  • Partial solution
    • Create auxiliary files of mechanically generated matches
results of frbr matching on worldcat
Results of FRBR Matching on WorldCat
  • 88% of manifestations are ‘singletons’
  • 30% of manifestations are in 12% of the works
  • Average size of multiple matches: 3.1 manifestations/work
  • 43.1 million works in 54 million manifestations
  • 54% of holdings on a FRBR work with >1 manifestation
  • WorldCat manifestations average about 20 holdings
  • FRBR helps where help is most needed
more frbr results
More FRBR Results
  • 310,000 works have more than 5 manifestations
  • 1.7 million have more than 2 manifestations
  • Largest: 30,000+ for the Bible
  • 1,537 Shakespeare’s Macbeth
  • 1,026 Dickens’s Christmas Carol
our beowulf cluster
Our Beowulf Cluster
  • 24 Nodes
    • Each with 2x2.6 GHz processors
    • 4 GBytes memory (96 GBytes total)
  • One ‘head’ node, 23 ‘compute’ nodes
  • 46x40 GBytes disk (~2 Terabytes total)
  • Gigabit switch
what we are using it for
What we are using it for
  • All our bibliographic processing
    • FRBR
    • Extractions
    • Searching
    • Matching
starting point
Starting point
  • FRBR key generation
  • 25 hours on a 3.00GHz workstation with 2GB of RAM
  • Generate two key files
    • sort by key, uniq by key, sort by occurrence
    • sort by key, post processing on keys, uniq by key, sort by occurrence
  • Merge key files
frbr on the cluster
FRBR on the Cluster
  • 44 minutes on the cluster
  • 69 key builders & 23 sort buckets with hyperthreading ON
  • Generate 23 radix-sorted, post-processed key files
  • Collapse and sort by occurrence in parallel
  • Also outputs additional files used by other jobs
application preservation
Application: Preservation
  • Identify ‘final copy’ items
  • Do it at the work level
  • Single-singles
    • Single manifestations with single holding
    • Found 18 million in WorldCat
application xisbn
Application: xISBN
  • A simple Web service
  • Given an ISBN:
    • Identify the workset it is in
    • Return all other ISBNs in that workset
  • Results should be symmetrical!
    • Same group retrieved for each ISBN in group
  • ISBNs sorted by number of library holdings
xisbn example
xISBN Example

http://labs.oclc.org/xisbn/0-19-281664-0 returns:

<?xml version="1.0" encoding="UTF-8" ?>

<idlist>

<isbn>0192816640</isbn>

<isbn>0820312037</isbn>

<isbn>0820315370</isbn>

<isbn>0393015920</isbn>

<isbn>0393952274</isbn>

<isbn>0393952835</isbn>

<isbn>0140430210</isbn>

<isbn>0192811320</isbn>

<isbn>0192835947</isbn>

<isbn>0460872885</isbn>

<isbn>1853262706</isbn>

<isbn>0874131219</isbn>

</idlist>

matching on isbns
Matching on ISBNs
  • ISBN additional information beyond Author/Title
    • Allows relaxation of matching
    • Introduces possible errors
  • Offers the possibility of substantial improvement of work matching
merging worksets using isbn matches
Merging Worksets Using ISBN Matches
  • Pair ISBNs with FRBR keys

(Starts with 10 million ISBNs)

  • Throw out ISBNs in single worksets
  • Throw out ISBNs in > 5 worksets

(We now have 561,000 ISBNs left)

  • Are the titles similar enough?
  • Throw out large groups
  • Try to be very conservative
  • Authority file always overrides other matching
matches from isbn matching
Matches from ISBN Matching
  • 74,000 author variants
  • ~200,000 title variants
  • These all create additional cross reference records
  • Automatically folded into FRBR matching
  • Kept separate from NACO file
    • Only used in research at this time
examples of possible matches
Examples of Possible Matches
  • /mcgraw hill encyclopedia of science & technology
  • /mcgraw hill encyclopedia of science & technology\1\aar aor
  • /mcgraw hill encyclopedia of science & technology\2\apa boo
  • /mcgraw hill encyclopedia of science & technology\3\bor cle
  • /mcgraw hill encyclopedia of science & technology\4\cli cyt
  • dickens, charles\1812 1870/tale of two cities
  • dickens, charles\1812 1870/hard times
  • dickens, charles\1812 1870/sketches by boz
  • dickens, charles\1812 1870/martin chuzzlewit
  • dickens, charles\1812 1870/bleak house
  • dickens, charles\1812 1870/little dorrit
  • dickens, charles\1812 1870/oliver twist
fictionfinder
FictionFinder
  • Indexes fiction from WorldCat
  • Uses FRBR workset algorithm
  • Focused on fiction
  • Searching and browsing by
    • Genre
    • Fictitious Characters
    • Imaginary Places
    • Literary Forms
  • Links to
    • Google
    • Open WorldCat
  • Diane Vizine-Goetz’s project
additional matches
Additional Matches
  • Match variant titles:
    • When the wind blows
    • When the wind blows: a novel
  • FictionFinder identified 10,000 of similar variations
    • novela, novella, roman, …
  • Created auxiliary authority records
  • Now automatically used when FRBR algorithm is run
future
Future
  • Continued development of FictionFinder
  • Extending algorithm to serials?
  • FirstSearch displays
  • Additional matching criteria
  • Local authority files?
  • Integration of auxiliary files for production?
  • Exploring FRBRizing some European catalogs
  • Looking at extending beyond Roman characters
links
Links
  • IFLA FRBR - Final Report
    • http://www.ifla.org/VII/s13/frbr/frbr.htm
  • Article in DLib
    • http://www.dlib.org/dlib/september02/hickey/09hickey.html
  • OCLC Research Activities with FRBR
    • http://www.oclc.org/research/projects/frbr/
  • FictionFinder
    • http://fictionfinder.oclc.org/
  • Top 1000
    • http://www.oclc.org/research/top1000/
ad