1 / 19

Where ’ s My Data?

Where ’ s My Data?. Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edu Sponsored by. The Problem. The request seems simple but the solution is complex

zareh
Download Presentation

Where ’ s My Data?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edu Sponsored by

  2. The Problem • The request seems simple but the solution is complex • Generally asked “who did / used x?” which leads to other questions • Where’s the data? • What’s the grain of the answer? • So how do we answer these questions? • If lucky, run script / query against a database and generate report • If not lucky, build an application to answer the question • This is what MetriDoc is built for

  3. Current Solution - Datafarm Datafarm = Crontab + Perl + CGI = Spaghetti Gate Count Voyager Blackboard Ezproxy App 1 App 2 App 3 App n DLA logs Penn Community Datafarm Borrow Direct COUNTER

  4. Datafarm Shortcomings • Maintainability issues • Not shareable • Not reusable

  5. MetriDoc = Datafarm 2.0 • As our system grew, we began creating MetriDoc to address Datafarm’s problems • Needed a scheduler that was more sophisticated than cron • Needed languages that were more maintainable than perl • Needed integration tools to simplify data gathering across disparate systems • We built prototypes and services to help us evaluate technologies • Received a grant from IMLS to speed up development • Hired another programmer

  6. MetriDoc Philosophy • Keep it simple • Sometimes a script is all you need • Ease of use is more important than performance • Don’t recreate the wheel • 100% open source • Sharable data

  7. MetriDoc – How it Works • MetriDoc’s core is built around database schemas • A MetriDoc implementation consists of loading tables and normalized tables • Loading tables prime the repository • The user is responsible for populating these tables • Normalized tables are built from the data in the loading tables • MetriDoc takes care of this • Conforming to similar schemas provides interesting possibilities • Sharing data is easy • Sharing a single repository is easy (think amazon web services) • Easier to collaborate • From a user’s perspective • MetriDoc has tools to get your stuff in the loading tables • But ultimately you just need to get it in there, so you can use whatever • Use the MetriDoc tools to manage your integration needs • Useful for getting, transforming / resolving, moving and loading data

  8. MetriDoc – Core Technologies • JVM • Java is used for infrastructure • Groovy is the primary language • Master Scheduler • Essentially the brains of MetriDoc • Using Hudson for now (http://hudson-ci.org/) • Integration Tooling • Tooling built on top of Apache Camel (http://camel.apache.org/) • Helps move data from one place to another • Really helpful for batch processing • Resolutions / Transformation Tools • Patron anonymization, text normalization, resource id to title resolutions, etc.

  9. The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 1 – Fill the loading tables Voyager Ezproxy COUNTER Load Ezproxy Load Counter Loading Tables Hudson Load Patron Info

  10. Loading Tables 00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 -0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=0264410X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vaccine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.&spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul-Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f-5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL

  11. The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 2 – Populate the normalized tables Loading Tables Normalize Ezproxy Normalize Counter Repository Hudson Normalize Patron Info

  12. Jenkins – Death to Cron • Generally used for building software, but a fantastic cron replacement • Can run arbitrary scripts locally and remotely • Supports master / slave distribution model seamlessly • Can be managed entirely via REST • Extensible • Helps with job dependencies • It is simple and free • Active community with a huge collection of plugins

  13. A Little Groovy

  14. The MetridocJob Framework

  15. The MetridocJob Framework

  16. The MetridocJob Framework

  17. Metrics on the Cheap

  18. Metrics on the Cheap

  19. Where we are….

More Related