Extension of cagrid federated query for large heterogeneous data services
Download
1 / 28

Extension of caGrid Federated Query for Large Heterogeneous Data Services - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Extension of caGrid Federated Query for Large Heterogeneous Data Services. Eta S. Berner, EdD Elliot Lefkowitz, PhD John David Osborne, MS Harsh Taneja, MS Niveditha Thota, MS Curtis Hendrickson Don Dempsey, MS Matthew Wyatt, MSHI John-Paul Robinson Poornima Pochana, MS

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Extension of caGrid Federated Query for Large Heterogeneous Data Services' - elma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Extension of cagrid federated query for large heterogeneous data services

Extension of caGrid Federated Query for Large Heterogeneous Data Services

Eta S. Berner, EdD

Elliot Lefkowitz, PhD

John David Osborne, MSHarsh Taneja, MS

Niveditha Thota, MSCurtis HendricksonDon Dempsey, MS

Matthew Wyatt, MSHI

John-Paul Robinson

Poornima Pochana, MS

Shantanu Pavgi, MS

Geoff Gordon, MS

Tim Day, PhD

Greg Fuller


Objectives
Objectives Data Services

  • Background

  • Customization of caGrid stack

    • Scaling for Large Dataset

    • Optimization of Query

    • Query Chunking in FQP

    • WS-Enumeration in Client(Controller), FQP & Data Services

  • Outstanding Issues

  • Summary


Background
Background Data Services

  • UAB has developed a Custom “Cohort Discovery” tool

    • Query based upon: Age, race, gender, Labs, Diagnosis, Procedure

    • Aggregate Results (numbers) stratified by: Age, Race, and Gender

  • Two caCORE SDK generated data services

    • Administrative Data (Demographics etc)

      • Patient table with simple demographics (~700 K)

      • Diagnosis, Encounter, Procedures (~12 M)

    • Labs (Lab Results)

      • Patient table (~700K)

      • Lab Result table (~185 M)

  • Federated Query Processor (modified 1.3 Snapshot)

  • Controller generates DCQL for FQP that always targets Admin System’s patient table and (optionally) labs

    • MRN is the identifier to link Admin System’s patient data to lab results


  • Aggregate cohort estimator ace
    Aggregate Cohort Estimator (ACE) Data Services

    Query Constraints could be:

    Age, Race, Gender

    Labs, Diagnosis, Procedures


    Ace result screens
    ACE Result Screens Data Services

    Results can be grouped by:

    Counts

    Gender

    Race

    Age

    Race* Gender

    Race * Age

    Gender * Age

    Race * Gender * Age


    Architectural overview
    Architectural overview Data Services

    UAB Data Center VLAN (private)

    F Q P(internal)Federated Query Processor

    Controller (RESTful Web Service)

    DCQL Generator

    Admin System

    ~12 M

    User Interface

    Grid Data Services

    Labs~185 M

    Shibboleth (AuthN & AuthZ)

    Controller DB


    Problem customization of cagrid stack
    Problem – Customization of Data ServicescaGrid Stack

    • Scaling for Large Dataset

    • Optimization of Query

    • Query Chunking in FQP

    • WS-Enumeration in Client(Controller), FQP & Data Services


    Scaling for large dataset
    Scaling for Large Dataset Data Services

    • Time out was overridden to 24 hrs in FQP & Data Services

    • Row Count was increased from 1K to 1M in Data Services

    • DCQL was restructured in Controller to avoid table space overflow errors due to the Cartesian joins

      • this occurs only as a result of "AND" statements

      • Occurs only when row count is high

      • This was not required against Admin Systems (12M vs 185 M in labs)

      • And not with “OR” queries against labs, which can run with a join-free SQL statement

    • FQP should be able to analyze DCQL and run it efficiently since similar to how a relational database query analyzer does it


    Before and after the restructuring
    Before and After the Restructuring Data Services

    Before

    Attribute: Lab A

    Foreign AssociationGroup: AND

    Attribute: Lab B

    Attribute: Lab C

    After

    Association: Lab A

    Foreign AssociationGroup: AND

    Association: Lab B

    • Foreign AssociationGroup: AND

    • Foreign Association

    Association: Lab C


    Problem customization of cagrid stack1
    Problem – Customization of caGrid Stack Data Services

    • Scaling for Large Dataset

    • Optimization of Query

    • Query Chunking in FQP

    • WS-Enumeration in Client(Controller), FQP & Data Services


    Query optimization
    Query Optimization Data Services

    Federated Query Processor

    Grid Data Service

    Query 1

    Response 1 = 250 K

    Query 2 + 250 K

    Response 2 = 100 K

    Query 3 + 100 K

    50K


    Query optimization1
    Query Optimization Data Services

    Step 1: Controller pre-runs count-only CQL queries.

    For example:

    Count(A) = 250K,

    Count(B) = 100K &

    Count(C) = 50K

    Step 2: Reorder DCQL query so that the most restrictive statements are executed first.


    Query optimization2
    Query Optimization Data Services

    Federated Query Processor

    Grid Data Service

    Query 1

    Response 1 50 K

    Query 2 with 50K

    Response 2 50K

    Query 3 with 50K

    Response 3 50K

    Smallest-Data-Set-First reduces size of all sub queries


    Problem customization of cagrid stack2
    Problem – Customization of caGrid Stack Data Services

    • Scaling for Large Dataset

    • Optimization of Query

    • Query Chunking in FQP

    • WS-Enumeration in Client(Controller), FQP & Data Services


    Problem with large sub queries
    Problem with Large Sub Queries Data Services

    • Problem: Too many identifiers (>300k MRNs from Labs in our case)

      • FQP

        • Passes huge OR clause down to data service

      • Data Services

        • Uses hibernate which parses OR clause recursively, thus blowing the stack for large results with typical JVM settings

        • Solution – fix both hibernate and JVM stack size setting

      • Database

        • Chokes on large queries consisting of

          • Where In (MRN1, MRN2, …. MRNn) or

          • Where Attribute1 = value1 or Attribute2 = value2 or … AttributeN = valueN

        • No success with either Oracle or MySQL even after adjusting settings like max packet size, etc


    Solutions query chunking in fqp
    Solutions - Query Chunking in FQP Data Services

    • Introduced Query Chunking in FQP --limits number of MRNs in where clause of native queries at database

    • Controlled by a new “chunk size” parameter in FQP

    • If any sub-CQLQuery returns more rows than the “chunk size”, the dependent query will be run N times, once per chunk

      e.g. say Chunk Size (d)= 1000 & Result Size (c) = 10096

      This resulted in successful completion of Complex Query in finite amount of time.

    Number of CQL Queries (n) = Result Size (c)/ Chunk Size (d)

    No. of CQL Queries (n) = 10096 / 1000 = 11 CQL Queries {Smallest with 96 parameters}


    Problem customization of cagrid stack3
    Problem – Customization of Data ServicescaGrid Stack

    • Scaling for Large Dataset

    • Optimization of Query

    • Query Chunking in FQP

    • WS-Enumeration in Client(Controller), FQP & Data Services


    Problem xml serialization and de serialization is expensive
    Problem – XML Serialization and De-serialization is Expensive

    • XML is used to deliver results of CQL queries

      • A single XML result file is generated

      • WS-Enumeration can break a result down into smaller file pieces but

        • Was not used by FQP to query the grid data services

        • Data service, grid and FQP all serve WS-Enumeration requests by de-serializing entire object in memory

        • The entire object is then written to disk as a resource to serve the client


    Solution ws enumeration in client controller fqp data services
    Solution: WS-Enumeration in Client(Controller), FQP & Data Services

    To utilize WS-Enumeration

    • Grid Data Services were generated with caGrid WS-Enumeration enabled.

    • FQP: implemented new code to support WS-Enumeration

    • Used Federated Query Results Client’s Enumerate method in Controller.

      Using WS Enumeration end-to-end allowed transfer of larger data sets over SOAP from Data Service to ACE Controller.

    Controller

    WS-Enumeration Enabled Grid Data Service

    Federated Query Processor


    Non standard configuration settings
    Non Standard Configuration Settings Services

    • WS-Enumeration services returned ALL associations associated on the target object and generated lazy load exceptions

    • David Erwin’s patch permitted lazy loading and prevented unwanted associations on the target object from being returned. This vastly reduced the size of returned results and subsequent network overhead.

    • Changed default JVM sizes for data services and FQP (currently 15G and 6G respectively)

    • Turned off ECache as unsuitable for our application, Caches consume memory, and disk space.


    Outstanding issues
    Outstanding Issues Services

    We did not resolve the issue with translation of CQL to efficient SQL with Associations in them, and we worked around this by Joining using Foreign Associations, whereas fixing the CQL to SQL would (theoretically) have been more appropriate.


    Summary
    Summary Services

    • After several bug fixes, FQP is able to handle extremely large data sets.

    • With Customizations in caGrid Stack we are able to utilize the benefits of the technology that enables us to share information and analytical resources efficiently.

    • With ACE application built on the caGrid Stack we are able to facilitate the inter-departmental data sharing within UAB.


    Acknowledgements
    Acknowledgements Services

    Working with caGrid Knowledge Center has been very helpful.

    • Justin D. Permar

      Senior Consultant, Biomedical InformaticsDirector, Center for IT Innovations in Healthcare (CITIH)

    • David W. Ervin

      Biomedical Informatics ConsultantCenter for IT Innovations in Healthcare, Team Manager

    • William Stephens

      Senior Biomedical Informatics ConsultantCenter for IT Innovations in Healthcare, Team Manager


    Uab team
    UAB Team Services

    CCTS (CTSA)

    Lisa Guay-Woodford, MD (PI)

    Eta S. Berner, EdD (Director)

    Elliot Lefkowitz, PhD (Director)

    Matthew Wyatt, MSHI

    John David Osborne, MS

    R. Curtis Hendrickson

    Harsh Taneja, MS

    Niveditha Thota, MS

    Don Dempsey, MS

    Health Systems Information Systems (HSIS)

    Geoff Gordon, MS (Web Development Director)

    Steve Osburne (IT)

    Terrell W Herzig (Data Security Officer)

    Tim Day, PhD

    Greg Fuller (GUI)

    Suresh Nair (DBA)

    UAB Health System Data Resources Group

    Andy Matthews

    Stephen W Duncan

    Darlene Green, RN, DSN

    UAB IT Research Computing

    John Paul Robinson (Lead)

    Poornima Pochana MS

    Shantanu Pavgi MS

    Comprehensive Cancer Center:

    John Sandefur MBA, CISSP

    FUNDING:

    UAB CCTS is funded through a CTSA grant (5UL1 RR025777)


    Thank you. ServicesQuestions?


    Dcql structure before restructuring

    DCQL Structure Before Restructuring

    • Lab C

    • Lab A



    Dcql structure after restructuring1
    DCQL Structure After Restructuring Services

    • Lab B

    • Lab C

    • Lab A


    ad