html5-img
1 / 36

Automated translation of LCSH

Automated translation of LCSH. Sirsi Unicorn API Summit 2004 Halifax Public Library October 17-18th, 2004. Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels. Automated translation of LCSH. Why? Overview of the solution Technically speaking … Unicorn configuration

zach
Download Presentation

Automated translation of LCSH

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated translation of LCSH Sirsi Unicorn API Summit 2004 Halifax Public Library October 17-18th, 2004 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels

  2. Automated translation of LCSH • Why? • Overview of the solution • Technically speaking … • Unicorn configuration • Creation of an LCSH/RVM dictionary • AUTHTRAN report • Bulk translation of LCSH in the catalog • Future developments

  3. Why? • 1986: start of retrospective conversion of catalog cards through RETROCON project with OCLC • introduction of LCSH • 1986 - political statement: • all bibliographic descriptions should be searchable through both the english and french version of the corresponding LCSH subject entries • Possible solutions: • Duplication of english and french subject entries in catalog record • Cross-referencing from french to english terms

  4. Why? • Cross-referencing is not good enough! • Impossible to navigate on french subject headings • Too complicated for end user (needs to go through cross-references for every search) • Example • 1986 - adopted solution: • each bibliographic description contains all LCSH in english and french • Université de Laval systematically translates LCSH into french • ULB decides to adopt RVM for its french translations

  5. Overview of the solution • Example • iLink@ULB • Corresponding catalog and authority records in Workflows

  6. Overview of the solution

  7. Overview of the solution

  8. Overview of the solution • Manually • Most of our catalog records are derived from Z39.50 sources (containing english LCSH)Original cataloging: LCSH manually added to catalog record • For every catalog record • For every LCSH • Validate LCSH • Browse/select from authority index • Create new authority record with RVM translation • Copy/paste RVM translation from authority record into catalog record • Validate RVM translation • Create new authority record • Workload very high: 15 min per catalog record • Very frustrating: • same LCSH subfields need to be translated over and over again • RVM heading gets entered 3 times (cat record; 2 x auth record)

  9. Overview of the solution • Automated • manually translate the english LCSH heading ONCE(in the LCSH authority record) • automatically generate the french RVM authority record • automatically generate the french RVM subject entries in the catalog record

  10. Unicorn configuration • Authority formats • Topical • ENGFRE • FREENG • Geographical • GEO-ENGFRE • GEO-FREENG • Authority indexes • English LCSH authority records are postedto ENGFRE • French RVM authority records are posted to FREENG • Authority index variations in catalog formats (MARC, SERIAL, …) • English LCSH entries -- ind2=0 (650-0; 651-0) • Validated against the ENGFRE authority index • French RVM entries -- ind2=6 (650-6; 651-6) • Validated against the FREENG authority index

  11. Unicorn configuration Authority formats ENGFRE FREENG GEO-ENGFRE GEO-FREENG 150— 750-6 650 150— 650 151— 751-6 651 151— 651 Authority indexes ENGFRE FREENG Catalog formats 650-0 650-6 651-0 651-6

  12. Unicorn configuration Authority formats

  13. Unicorn configuration Catalog formats

  14. Creation of an LCSH/RVM dictionary • Since 1986 LCSH and RVM have been added to catalog records « in the same order » • Thanks to this work we can now build an LCSH/RVM dictionary 245-- Title 650-0 LCSH-1 650-0 LCSH-2 651-0 LCSH-3 650-6 RVM-1 650-6 RVM-2 651-6 RVM-3

  15. Creation of an LCSH/RVM dictionary • Step 1: empty subject authority database selauthority –f’ENGFRE,GEO-ENGFRE,FREENG,GEO-FREENG’ | remauthority

  16. Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Dump subjects from all catalog records selcatalog –f’MARC,SERIAL,MAP,…’ | catalogdump –of –ka888 | filtermarc –i’650,651,888’ –od –Ds • Create English/French subject pairs LCSH-1||RVM-1||650 LCSH-2||RVM-2||650 LCSH-3||RVM-3||651 • Popular subject pairs Some English terms have been translated into different french terms during the 15 year manual input; we want to get rid of the wrong translation by counting them and only retain those translations with the biggest occurrence

  17. Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Create « extended » subject pairs • derived from the original subject pairs; by iteratively chopping off the last subfield of the english and french part of the subject pair; the so obtained extended subject pair is only retained if the english part of it exists in the SUBJECT browse index • Original: Aids (disease)|xprevention||Sida|xpréventionExtended: Aids (disease)||Sida • Popular extended subject pairs • Merge subject pairs and extended subject pairs together; and retain popular pairs

  18. Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Create ENGFRE,FREENG,GEO-ENGFRE and GEO-FREENG flat authority records Aids (disease)|xPrevention||Sida|xPrévention||650 *** DOCUMENT BOUNDARY *** FORMAT=ENGFRE .150. Aids (disease)|xPrevention .750. Sida|xPrévention *** DOCUMENT BOUNDARY *** FORMAT=FREENG .150. Sida|xPrévention

  19. Creation of an LCSH/RVM dictionary • Step 3: populate authority database cat subjects.authinput | authload –s subjects.authload.errors –fa –mc –q’TODAY’ selcatalog | authcheck –m rebuildtext report rebldthesauri report correcthesauri report (several times) • Some figures # ENGFRE: 123016 # FREENG: 123030 # GEO-ENGFRE: 23069 # GEO-FREENG: 23047

  20. AUTHTRAN report • Create RVM entries in touched catalog records. • A catalog record can be touched through: • an edit operation on the catalog record • a creation, modification or deletion of an authority record • Create FREENG/GEO-FREENG authority record for every touched ENGFRE/GEO-ENGFRE authority record. • An authority record is touched through: • an edit operation on the authority record

  21. AUTHTRAN report

  22. AUTHTRAN report • Build authkeysfile from the ‘gpn authedit’ directory • cat authkeysfile | authdump –ki | filtermarc –iALL –od –Ds > delimfile • Create flat authority records from records in delimfile • Load new authority records • cat newauthrecsfile | authload –mc –q’TODAY’ • authload checks for already existing authority records • touchkeys new authority records (for reindexing through adutext)

  23. AUTHTRAN report 1. Catalog records that have been edited (through Workflows for example) • Catalog keys to be found in the ‘gpn textedit’ and ‘gpn browsedit’ directories 2. Catalog records can be touched through the creation, modification or deletion of an authority record 2.1.Creation of authority record • Find catalog records that contain the new LeadTerm (LT) • echo ‘authkey’ | autheditor –e | seltext 2.2.Modification of an authority record – change of LT • Find catalog records that contain the old LT echo‘authkey’ | autheditor –c | seltext • Find catalog records that contain the new LT echo ‘authkey’ | autheditor –e | seltext

  24. AUTHTRAN report 2.3.Modification of an authority record – other change • Find catalog records that are authorized against this authority record echo ‘^Aauthkey’ | seltext • Find catalog records that contain the LT; and which are NOT authorized against this authority record • Construct heading from LT • Lookup heading key for this headingecho ‘constructed heading’ | selheading –iT –oKTn –b’SUBJECT’ • Look for catalog records with this heading key headinginfo = ‘^G003heading’ echo headinginfo | seltext 2.4. Deletion of authority record • Authority record could have been modified before deletion; we therefore need to consider all cases as under 2.2 and 2.3.

  25. AUTHTRAN report 3. Look for lost catalog keys • will explain this later 4. Merge and deduplicate all these catalog keys

  26. AUTHTRAN report • Only validated LCSH get a chance of being translated; so first authority check the catalog records cat touched_catkeys | authcheck –m • Dump all touched catalog records cat touched_catkeys | sort –n |\ catalogdump –of –ka888 –z –J |\ filtermarc –iALL –od –Ds > dumpfile • Lookup french RVM translations in corresponding authority records perl authtran_4.pl dumpfile > translations • Recreate catalog records • original record without FREENG/GEO-FREENG subjects + add new translations perl authtran_5.pl dumpfile translations > newdump

  27. AUTHTRAN report • Split up file of new catalog records according to format policy (-a is a mandatory option on catalogload) • Reload each of these formatfiles cat fileforformatX |\catalogload –aX –if –bc –umy –j –r –mu –e/dev/null -4authtran_junktag -3> loadedcatkeys • Only reload if necessary • if there are any modifications in the catalog record • Don’t reload if too many new records (´gpn custom´/authtran) • see authtranrbld report • will explain this later

  28. AUTHTRAN report • Make sure the RVM entries in the reloaded catalog records are validated against the FREENG authority index cat loadedcatkeys | authcheck –m • Only touchkeys uptil a limit (set in ´gpn custom´/bulktext) • will explain this later

  29. AUTHTRAN report • Executed daily before adutext • Finished report - listing

  30. AUTHTRAN report • Recover « lost » catalog keys • Execution of the adutext report will remove catalog keys from the ‘gpn textedit’ and ‘gpn browsedit’ directories. Executing the adutext report before authtran would lead to keys of catalog records getting lost. • cadutext report • customized version of adutext • saves treated catalog keys to special directories • authtran has the knowledge of finding these lost keys back.

  31. AUTHTRAN report • Reloading « many » catalog records • Problem • could result in a long execution time of the authtran report; and hence jeopardize the execution of the (many) other daily reports, including the critical daily backup procedure of the Unicorn filesystems. • if number of catalog records to be reloaded exceeds a threshold (set in ‘gpn custom’/authtran), NO catalog records get reloaded • file of catalog records gets saved to separate directorymail is sent to « authtran administrator »file can be manually fed to the authtranrbld script

  32. Reindexing « many » catalog records Problem touchkeys too many catalog keys could fill up the Unicorn filesystem, and hence jeopardize the correct functioning of Unicorn could result in a long execution time of the adutext report; and could hence jeopardize the execution of the (many) other daily reports, including the critical daily backup procedure of the Unicorn filesystems. Only a limited number of catalog records get reindexed (limit set in ‘gpn custom’/bulktext) Catalog keys of the other catalog records get saved to a separate directory cadutext will automatically (through a call to the bulktext script) pick up the ‘limited’ number of catalog keys and reindex the corresponding catalog records AUTHTRAN report

  33. Bulk translation of LCSH in the catalog • authtranrbldfre script • Purpose: generate FREENG/GEO-FREENG authority records for a given set of ENGFRE/GEO-ENGFRE authority records • Syntax: authtranrbldfre authkeysfile reportfile [Y|N] • authtranrbld script • Purpose: generate RVM translations in a given set of catalog records • Syntax: authtranrbld catkeysfile reportfile [Y|N] • feed ALL catalog records to the script; the complete catalog gets translated • number of catalog records reloaded: 200923

  34. Future developments • Automatically generate the translations of LCSH • based on RVM translation tables for uniterms (subfields a, x, y, z and v); to be bought from Université de Laval • need to build a local dictionary of ‘uniterms’ (not all uniterms are translated in RVM) • 100% automated translation seems impossible • |aA|zZ => |aA’ • |aFrench-Canadian literature|zQuebec (Province) • |aLittérature canadienne française|zQuébec (Province) • |aLittérature québecoise • |aA => {|aA’,|aA’’,…} • |aRight and left (Political science) • |aGauche (Science politique) • |aDroite (Science politique) • |aExtrême droite • |aNouvelle droite • |aExtrême gauche

  35. Automated translation of LCSH Available soon in Randy’s API repository

  36. Questions

More Related