1 / 41

Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit

Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit. Ben Anderson, Software Engineer, XCO. Download this presentation: www.extensiblecatalog.org/learnmore. 2. Timeline. Jennifer Bowen presented at code4lib 2/10. work began on 0.3 4/10. 0.3 released 1/11.

veata
Download Presentation

Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software Engineer, XCO

  2. Download this presentation:www.extensiblecatalog.org/learnmore 2

  3. Timeline Jennifer Bowenpresented at code4lib2/10 work beganon 0.34/10 0.3 released 1/11 0.2 released 1.0 released I began at XCO3/10

  4. XC Software Components User Interface for searching and browsing Library Website (on Drupal) XC Drupal Toolkit XC Circ. Status/Req. Authentication Tools for automated processing of large batches of metadata XC Metadata Services Toolkit MARCXML(6M records) DC-TERMS(13k records) XC NCIPToolkit Tools for connectivity between XC and an ILS XC OAI Toolkit irplus Integrated Library System Repository

  5. Learn More about XC atwww.extensiblecatalog.org

  6. One Example of Process Flow FRBRized records from transformation service NormalizedMARC BIBrecord from normalization service MARC BIBrecord from external repository work expression manifestation

  7. Logical Process Pseudo OAI-PMH Harvests MARCNormalizationService MARC-XCTransformationService OAI-PMH Harvest OAI-PMH Harvestable M S T repo repo provider caches

  8. Add an External Repository

  9. Schedule a Harvest

  10. Configure Processing Rules

  11. Browse Records

  12. Goals for 0.3 • Each service should process one million records per hour on an “average library server” • 1.5 GHz SPARC V9 • 8G RAM (3G for the JVM) • 10k RPM hard drive • Services should have little to no degradation as the size of a repository grows • University of Rochester has 6M records • Implementing a service should be easy • it should require no knowledge of MST internals • it should not be up to the service implementer to figure out how to build and package their service

  13. Determine Throughput of 0.2 • Using the MARC Normalization service as our metric, the first million records processed at average at a speed of: • 29 ms/record = 120k/hr (goal is 3.6 ms/rec = 1M/hr) • Before the service processed 2 million records, the process crawled to a halt (goal was little to no degradation of at least 6 million records).

  14. Determine Bottlenecks with TimingLogger This code produces this output

  15. Bottleneck Breakdown • 29 ms per record • 2.5 ms to create DOM • 5 ms for actual service processing (the innards of the MARC-Normalization service) • 21 ms for querying solr and inserting • This is the average - both querying and inserting are done in batch. • I had a hard time separating the two

  16. 0.2 Design • All data needed for the UI • except for searching and browsing records • All data needed for configuring harvests, services, processing rules, etc • Text indexes necessary for searching and browsing records • All record/repository data

  17. 0.3 Design Change to use MySQL • All data needed for the UI • except for searching and browsing records • All data needed for configuring harvests, services, processing rules, etc • All record/repository data • Doesn’t store any data • Use only for indexing records to support searching in the UI

  18. 0.3 Design – Keep the table sizes small One index for all repositories Each external repository cache and each service gets its own set of database tables external providerrepo normalization repo transformation repo

  19. 0.3 Design - Yes, a boring ERD one per record zero or moreper record one or moreper record

  20. Did that improve things? • 11 ms per record (previously 29) • 2.5 ms to create DOM • 5 ms for actual service processing (the innards of the MARC-Normalization service) • 3.5 ms (previously 21) for querying MySQL and inserting into MySQL • again, both querying and inserting are done in batch • The query time is almost nill - it’s the inserting that takes time. • It’s faster, but still nearly 3x slower than our goal • The performance showed little to no degradation

  21. Get rid of XPath XPath isn’t a bad technology, but when you’re optimizing for performance, it can be beneficial to find other ways to accomplish the same task. So, I changed this code… to this code…

  22. Did that improve things? • 7 ms per record (previously 11) • 2.5 ms to create DOM • 1.0 ms (previously 5) for actual service processing (the innards of the MARC-Normalization service) • 3.5 ms for MySQL inserts • It’s faster, but still nearly 2x slower than our goal

  23. Delayed Indexing in MySQL • MySQL modifies table indexes with each insert. • It is faster to the drop indexes, insert lots of rows into the tables, and then add the indexes back. • This is the way mysqldump works • This means you can’t read the data while doing an insert. No big deal – we’ll just do it during large loads.

  24. Did that improve things? • 6 ms per record (previously 11) • 2.5 ms to create DOM • 1.0 ms for actual service processing (the innards of the MARC-Normalization service) • 2.2 ms (previously 3.5) for MySQL inserts • It’s faster, but still nearly 2x slower than our goal

  25. Batch Prepared Statements Java/JDBC provides an extremely highly performant method for sending large chunks of data to the db at once using batch prepared statements. There’s no way to speed this part up… or so I thought…

  26. LOAD DATA INFILE When discussing db optimizations with XC’s Drupal Toolkit developer, Peter Kiraly, he said PHP didn’t have the same ability. Instead he’d have to write out a csv file and load that in. I figured I might as well try it.

  27. Did that improve things? • 4 ms per record (previously 6) • 2.5 ms to create DOM • 1.0 ms for actual service processing (the innards of the MARC-Normalization service) • 0.6 ms (previously 2.2) for MySQL inserts • Pretty close, but still not there

  28. Sometimes it’s the little things Code was DomFactoryBuilderDOAServiceFactoryFactoryImplI knew enough not to create the DocumentBuilderFactory each time, but didn’t realize creating the DocumentBuilder each time would have that much of an effect. Code is now

  29. Did that improve things? • 3 ms per record (previously 4) • 0.9 ms (previously 2.5) to create DOM • 1.0 ms for actual service processing (the innards of the MARC-Normalization service) • 0.6 ms for MySQL inserts • WE DID IT! We have exceeded our goal!

  30. 0.2 Service Development Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code.

  31. code.google.com/p/xcmetadataservicestoolkit/

  32. 0.3.x Service Development • Install Java, Ant, MySQL $ wget 'http://xcmetadataservicestoolkit.googlecode.com/files/example-0.3.0-dev-env.zip’ $ unzip example-0.3.0-dev-env.zip $ cd example $ ant retrieve $ ant -Dtest=ProcessFiles test $ ls -ladh ./build/test/actual_output_records/1/* $ ant zip

  33. Input Files for Testing $ ls -1 ./test/input_records/1/* | xargs cat <records xmlns="http://www.openarchives.org/OAI/2.0/"> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <fooxmlns="foo:bar">pb&amp;j</foo> </metadata> </record> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <fooxmlns="foo:bar">pb&amp;j 2</foo> </metadata> </record> </records>...

  34. Output Files from Testing $ ls -1 ./build/test/actual_output_records/1/* | xargs cat <records xmlns="http://www.openarchives.org/OAI/2.0/"> <record> <header status="replaced"> <identifier>oai:mst.rochester.edu:example/1</identifier> <datestamp /> <predecessors> <predecessor>oai:mst.rochester.edu:bib:1</predecessor> </predecessors> </header> <metadata> <fooxmlns="foo:bar"> pb&amp;j <bar>you've been foobarred!</bar> </foo> </metadata> </record>

  35. Implementing in Code

  36. More tidbits for interested implementers • The MST now is configured via spring • each service is given it’s own application context as well as it’s own classloader • This means it can use all the objects and services from the MST while not worrying about name collisions (naming or dependencies) w/ other services • Each service is given it’s own db schema (again, so you don’t have to worry about name collisions). The db schema is prefixed w/ “xc_”

  37. Other Services • MARC-XC-Transformation Just as fast as the marcnormalization service • DC-XC-Transformation Initially contributed by Kyushu University (in Japan) – now one of our core services. 38

  38. Photo Credits • All photos taken from flickr.com • “Brick Wall” by somenametoforget • “Snail” by DRB62 • “Paris Train” by Pictr 30D • “Spaghetti with tomato sauce” by HatM • “Hawk in Flight” by Nick Chill • “Tortoise” by GraphicReality

  39. Final Numbers 1.5 GHz CPU 0.3 • 1.2M records / hr3.0 ms / record • processed 16M records with no degradation • easily extensible 0.2 • 125k records / hr29 ms / record • fell down before 2M records processed • not easily extensible

  40. Download XC software ateXtensibleCatalog.orgcontact me atbanderson@library.rochester.edu

More Related