Frbr algorithms and applications
Download
1 / 40

frbr: algorithms and applications - PowerPoint PPT Presentation


  • 242 Views
  • Updated On :

FRBR: Algorithms and Applications. T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004. Outline. Algorithms FRBR work matching Handling author-title variants Hardware Beowulf cluster Applications Bookmarklets FictionFinder Future directions.

Related searches for frbr: algorithms and applications

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'frbr: algorithms and applications' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Frbr algorithms and applications l.jpg

FRBR: Algorithms and Applications

T. Hickey

J. Toves

D. Vizine-Goetz

Online Compuer Library Center

CLA November 2004


Outline l.jpg
Outline

  • Algorithms

    • FRBR work matching

    • Handling author-title variants

  • Hardware

    • Beowulf cluster

  • Applications

    • Bookmarklets

    • FictionFinder

  • Future directions


Working with group 1 entities l.jpg
Working with Group 1 Entities

WEMI:

Work

Expression

Manifestation

Item

  • Strict expression-level determination is hard

    • We primarily divide by language

  • Manifestation is easier

    • We use the WorldCat master record


Work identification l.jpg
Work Identification

  • Algorithm goals:

    • Efficient

    • Understandable

    • Controllable by catalogers

    • Uses existing WorldCat records


The algorithm l.jpg
The Algorithm

  • A key is generated for each record

  • Extract author, title

    • Look up in LC name authority file

    • Added entry information as needed

  • Form a key from bibliographic record

    • Author, title, added entry information

    • These can be sorted, compared




More detail l.jpg
More Detail

  • Extract author names

    • Look up in authority file

      • Currently only personal names

      • Subfields $abcdq

  • Extract title

    • Always use uniform titles if present

    • Look up author/short title (~$a)

    • Look up author/long title (~$abfgnp)

    • Prefer alternative title for non-English

  • Create key from author/title

    • Always do NACO normalization (has limitations)

    • Add information for uncontrolled title-main-entry


Authority files rule l.jpg
Authority Files Rule!

  • Authors

  • Author/titles

  • Bring together variations

  • Allow override in difficult cases

    • Both splitting and joining groups

    • Especially important with xISBN matching

  • Especially important with non-English metadata


Limitations of the authority file l.jpg
Limitations of the Authority File

  • What’s missing:

    • Many uniform titles

    • Many author variants

    • Many title variants

    • Language of heading

  • Partial solution

    • Create auxiliary files of mechanically generated matches


Results of frbr matching on worldcat l.jpg
Results of FRBR Matching on WorldCat

  • 88% of manifestations are ‘singletons’

  • 30% of manifestations are in 12% of the works

  • Average size of multiple matches: 3.1 manifestations/work

  • 43.1 million works in 54 million manifestations

  • 54% of holdings on a FRBR work with >1 manifestation

  • WorldCat manifestations average about 20 holdings

  • FRBR helps where help is most needed


More frbr results l.jpg
More FRBR Results

  • 310,000 works have more than 5 manifestations

  • 1.7 million have more than 2 manifestations

  • Largest: 30,000+ for the Bible

  • 1,537 Shakespeare’s Macbeth

  • 1,026 Dickens’s Christmas Carol






Our beowulf cluster l.jpg
Our Beowulf Cluster

  • 24 Nodes

    • Each with 2x2.6 GHz processors

    • 4 GBytes memory (96 GBytes total)

  • One ‘head’ node, 23 ‘compute’ nodes

  • 46x40 GBytes disk (~2 Terabytes total)

  • Gigabit switch


What we are using it for l.jpg
What we are using it for

  • All our bibliographic processing

    • FRBR

    • Extractions

    • Searching

    • Matching



Starting point l.jpg
Starting point

  • FRBR key generation

  • 25 hours on a 3.00GHz workstation with 2GB of RAM

  • Generate two key files

    • sort by key, uniq by key, sort by occurrence

    • sort by key, post processing on keys, uniq by key, sort by occurrence

  • Merge key files


Frbr on the cluster l.jpg
FRBR on the Cluster

  • 44 minutes on the cluster

  • 69 key builders & 23 sort buckets with hyperthreading ON

  • Generate 23 radix-sorted, post-processed key files

  • Collapse and sort by occurrence in parallel

  • Also outputs additional files used by other jobs


Application preservation l.jpg
Application: Preservation

  • Identify ‘final copy’ items

  • Do it at the work level

  • Single-singles

    • Single manifestations with single holding

    • Found 18 million in WorldCat


Application xisbn l.jpg
Application: xISBN

  • A simple Web service

  • Given an ISBN:

    • Identify the workset it is in

    • Return all other ISBNs in that workset

  • Results should be symmetrical!

    • Same group retrieved for each ISBN in group

  • ISBNs sorted by number of library holdings


Xisbn example l.jpg
xISBN Example

http://labs.oclc.org/xisbn/0-19-281664-0 returns:

<?xml version="1.0" encoding="UTF-8" ?>

<idlist>

<isbn>0192816640</isbn>

<isbn>0820312037</isbn>

<isbn>0820315370</isbn>

<isbn>0393015920</isbn>

<isbn>0393952274</isbn>

<isbn>0393952835</isbn>

<isbn>0140430210</isbn>

<isbn>0192811320</isbn>

<isbn>0192835947</isbn>

<isbn>0460872885</isbn>

<isbn>1853262706</isbn>

<isbn>0874131219</isbn>

</idlist>


Matching on isbns l.jpg
Matching on ISBNs

  • ISBN additional information beyond Author/Title

    • Allows relaxation of matching

    • Introduces possible errors

  • Offers the possibility of substantial improvement of work matching


Merging worksets using isbn matches l.jpg
Merging Worksets Using ISBN Matches

  • Pair ISBNs with FRBR keys

    (Starts with 10 million ISBNs)

  • Throw out ISBNs in single worksets

  • Throw out ISBNs in > 5 worksets

    (We now have 561,000 ISBNs left)

  • Are the titles similar enough?

  • Throw out large groups

  • Try to be very conservative

  • Authority file always overrides other matching


Matches from isbn matching l.jpg
Matches from ISBN Matching

  • 74,000 author variants

  • ~200,000 title variants

  • These all create additional cross reference records

  • Automatically folded into FRBR matching

  • Kept separate from NACO file

    • Only used in research at this time


Examples of possible matches l.jpg
Examples of Possible Matches

  • /mcgraw hill encyclopedia of science & technology

  • /mcgraw hill encyclopedia of science & technology\1\aar aor

  • /mcgraw hill encyclopedia of science & technology\2\apa boo

  • /mcgraw hill encyclopedia of science & technology\3\bor cle

  • /mcgraw hill encyclopedia of science & technology\4\cli cyt

  • dickens, charles\1812 1870/tale of two cities

  • dickens, charles\1812 1870/hard times

  • dickens, charles\1812 1870/sketches by boz

  • dickens, charles\1812 1870/martin chuzzlewit

  • dickens, charles\1812 1870/bleak house

  • dickens, charles\1812 1870/little dorrit

  • dickens, charles\1812 1870/oliver twist




Fictionfinder l.jpg
FictionFinder

  • Indexes fiction from WorldCat

  • Uses FRBR workset algorithm

  • Focused on fiction

  • Searching and browsing by

    • Genre

    • Fictitious Characters

    • Imaginary Places

    • Literary Forms

  • Links to

    • Google

    • Open WorldCat

  • Diane Vizine-Goetz’s project








Additional matches l.jpg
Additional Matches

  • Match variant titles:

    • When the wind blows

    • When the wind blows: a novel

  • FictionFinder identified 10,000 of similar variations

    • novela, novella, roman, …

  • Created auxiliary authority records

  • Now automatically used when FRBR algorithm is run


Future l.jpg
Future

  • Continued development of FictionFinder

  • Extending algorithm to serials?

  • FirstSearch displays

  • Additional matching criteria

  • Local authority files?

  • Integration of auxiliary files for production?

  • Exploring FRBRizing some European catalogs

  • Looking at extending beyond Roman characters


Links l.jpg
Links

  • IFLA FRBR - Final Report

    • http://www.ifla.org/VII/s13/frbr/frbr.htm

  • Article in DLib

    • http://www.dlib.org/dlib/september02/hickey/09hickey.html

  • OCLC Research Activities with FRBR

    • http://www.oclc.org/research/projects/frbr/

  • FictionFinder

    • http://fictionfinder.oclc.org/

  • Top 1000

    • http://www.oclc.org/research/top1000/


ad