1 / 37

Life Sciences: Data Revolution

Session : 40382. Life Sciences: Data Revolution. Building Gene Expression Databases. Mahendra Navarange. Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK. Agenda. What is Life Science? MiMiR : database for gene expression data

sutton
Download Presentation

Life Sciences: Data Revolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases Mahendra Navarange Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK

  2. Agenda • What is Life Science? • MiMiR : database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

  3. What is Life Sciences ? • Includes • Biology • BioTechnology • Chemistry • Pharmaceuticals • Agriculture / Plant Science • Environmental Sciences • ???? • Objective • Understand the molecular and evolutionary basis of living organisms

  4. Focus Areas • Genomics • Human Genome Project • Draft published in 2000 • Finished version on 14 April 2003 • Sequencing data doubles every year • Transcriptomics • Study of transcription (gene expression) • Proteomics • Study of translation (protein synthesis) Courtesy F. Hoffmann-La Roche Ltd.

  5. Data…Data…Data • Sanger Centre 5TB • Celera ~ 100TB+ (2001) TB

  6. Data Revolution in Life Sciences • Impact of technology • High throughput platforms (HTP) • Robotics • Miniaturisation • Data driven science • Datawarehousing technologies • Data mining and visualisation software Information Technology Life Sciences

  7. Databases • Genomics • Sanger • NCBI • TIGR • KEGG • Transcriptomics • ArrayExpress • Proteomics • Protein Databank (PDB) • SWISSPROT • Entrez

  8. Target Validation Using Life Sciences Data • identify causes of genetic diseases • discover new drug compounds • personalised medicine • develop new diagnostics Drug Discovery Pipeline HTP Screening Target Identification Clinical Trials Hits Leads Leads FDA

  9. Life Sciences : The Future • “…..biology is changing from a purely laboratory-based science to an information based science.” Eric Lander, Director, Whitehead Institute MIT

  10. Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

  11. Transcriptomics • Comparing gene expression across databases • Collaborate to share expertise • Benefits • Diagnostics • Screen target drug compounds • Identify toxic side effects • Screen patients for clinical trials

  12. Workflow Literature Experiment design Data HTP Preliminary Analysis Further Analysis Local DB GO NCBI Collaboration

  13. HTP Microarray Platform : Hardware Courtesy Affymetrix Inc., Dell Inc

  14. Microarray Data Acquisition Courtesy Fisher Scientific Courtesy Affymetrix Inc.

  15. Microarray Data • High density microarray • ~ 500,000 spots of ~18 µm size • >20,000 genes • Typical file size 45MB • No. of files produced in typical experiment 10-20. Courtesy Affymetrix Inc.

  16. Life Sciences Data Explosion • Data Characteristics • Image data generated by HTP platforms, annotation by researchers • Large volume and size • Varied data types • Datawarehousing challenges • Non-summarisable • High dimensionality • Limited knowledge of underlying biological processes • No standard industry data models or best practices

  17. Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

  18. System Requirements • Seamless data integration • Handle wide range of datatypes • Processor intensive and I/O intensive • Exponential growth in data storage • Open architecture, collaboration

  19. System Requirements • Rapid changes – new databases, technologies and instruments • Competitive pressures, quick response, low access times • Plug and play capability • Security

  20. MIcroarray Data MIning Resource • MiMiR – Microarray Datawarehouse • ~250GB. Expected to double in next few months • ~2500 images, over 1500 BioAssays • 52 tables, largest table 15GB • Infrastructure • Oracle 9i Release 1 on Windows 2000 • Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard disk • 1 TB NAS capacity

  21. Requirements vs. Solutions • Integrate different types of data sources • Use of XML for data exchange • Use of Oracle UltraSearch • Efficient data retrieval • Stringent response time standards on procedures • Indexed Organised Tables, Partitioning • Security • Firewall • Single Sign-On servers (in progress) • Rapid change management • BC4J framework, Jdeveloper • Extreme programming, prototyping

  22. Annotation MAGE-ML Spot Info Images JDeveloper Ext Ref Blast 9iAS Admin MiMiR System Architecture MiMiR Application Server XSQL XSU XDK BC4J JClient JSP ArrayExpress Private

  23. Oracle Products Used • Oracle 9i Database Server/Client (Release1) • Partitioning • Join indexing • Oracle 9i JDeveloper (9.0.2) • Oracle 9i Application Server (BC4J) • Oracle XML features • Oracle PL/SQL packages for XML • Oracle XSQL publishing framework • XDK (DOMParser and SAXParser) • XSU • Oracle Data Mining (Future) • Oracle Collaboration Suite (Future)

  24. Why Oracle ? • Readily scalable • Manage wide variety of data types • Integrated development tools • Support XML and Java • High performance middleware • Secure collaboration

  25. Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data acquisition and profiling • System requirements • Design issues • Code snippets

  26. Storage Storing XML in tables Storing XML in CLOBs Hybrid Generation XDK for Java, PL/SQL XSU Transformation XSL Stylesheet Views Processing XDK DOMParser XDK SAXParser Searching XPATH Oracle Text Publishing XSQL publishing framework XSL Oracle and XML :Design Issues

  27. Oracle and XML : XSQL Example <?xml version="1.0" encoding='windows-1252'?> <!-- | Uncomment the following processing instruction and replace | the stylesheet name to transform output of your XSQL Page using XSLT <?xml-stylesheet type="text/xsl" href="YourStylesheet.xsl" ?> --> <?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?> <xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql"> select * from array </xsql:query>

  28. Oracle and XML: Design Issues

  29. Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data profiling • System requirements • Design issues • Code snippets

  30. An Example • Creating XML from 500,000 records in the database

  31. Solution 1 • Using XSU Java API to get XMLDOM. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) XMLDocument.print(out);

  32. Solution 2 • Using XSU Java API to get XMLString. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) # XMLDocument.print(out); 7) System.out.println(q1.getXMLString());

  33. Solution 3 • Using dbms_xmlquery package to get XML output from SQL Select dbms_xmlquery.getXML(‘select * from IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dual <?xml version = '1.0'?> <ROWSET> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>

  34. Summary • Life sciences is generating enormous amount of data using HTP • The data is non-summarisable, distributed and has varied data types • Data integration and secure collaboration is key to success • MiMiR

  35. Dr. Helen Causton Prof. Tim Aitman Dr. Laurence Game Helen Banks Nicola Cooley Vihar Wadekar Helen Figueira MGED Data Society (www.mged.org) Acknowledgements

  36. Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases What Next : Opportunities for collaboration for development of Knowledge Management Systems for Drug Discovery Contact: mahendra.navarange@csc.mrc.ac.uk http://microarray.csc.mrc.ac.uk

More Related