1 / 19

Searching Large Scientific Data

Searching Large Scientific Data. John Wu Scientific Data Management Lawrence Berkeley National Laboratory. Outline. Highlight of Accomplishments Grid Collector ( accelerate others’ work ) Query-Driven Visualization ( enabling new way of knowledge discovery )

dcisneros
Download Presentation

Searching Large Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

  2. Outline • Highlight of Accomplishments • Grid Collector (accelerate others’ work) • Query-Driven Visualization (enabling new way of knowledge discovery) • Molecular docking (enabling others to accomplish great things) • Outlook • More complex searches • Parallelization • Supporting more data formats • Integration with large framework

  3. FastBit In a Nutshell FastBit is designed to search multi-dimensional append-only data Conceptually in table format rows  objects columns  attributes FastBit uses vertical (column-oriented) organization for the data Efficient for searching FastBit uses bitmap indices with our compression method Proven in analysis to be optimal for one-dimensional queries Faster than other optimal indexes for multi-dimensional queries column row [Wu, Otoo, Shoshani 2006]

  4. Motivation • Scientific datasets are getting larger fast • Most data analysis algorithm can not handle a whole dataset • Therefore, most data analysis tasks are performed on a subset of the data • Some examples of searches • Find the collision events with the most distinct features of Quantum-Qluon-Plasma from a high-energy physics experiment • Find and tracking ignition in a combustion simulation • Identify the puppet-master bedind a distribution denial-of-service attack on a computer network

  5. Highlight 1 – Grid Collector • Searching over billions of objects with hundreds of attributes each: • Distributed analysis over the Grid • Make petabytes of raw data available for world wide analyses • Benefits of the Grid Collector: • Transparent object access, select objects based on their attributes • Improvement of analysis system’s throughput • Best Paper Award (ISC’05) [Wu, Gu, Lauret, Poskanzer, Shoshani, Sim and Zhang 2005]

  6. Grid Collector Speeds up Analyses • Test machine: 2.8 GHz Xeon, 27 MB/s read speed • When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster • Using GC to read 1/2 of events, speedup > 1.5, 1/10 events, speed up > 2. • Bottom line – improve the throughtput of data analyses!

  7. Highlight 2 – Visualization • Query-Driven Visualization – collaboration between SDM and VACET • Use FastBit indexes to efficiently select the most interesting data for visualization • Above example: laser wakefield accelerator simulation • VORPAL produces 2D and 3D simulations of particles in laser wakefield • Finding and tracking particles with large momentum is key to design the accelerator • Brute-force algorithm is quadratic (taking 5 minutes on 0.5 mil particles), FastBit time is linear in the number of results (takes 0.3 s, 1000 X speedup)

  8. Bin-Based Parallel Coordinate Display • Integrate FastBit with H5Part, a HDF5 package for particle physics data • Use FastBit to compute histograms efficiently • Bin-based parallel coordinate display reduces the number of lines displayed on screen, reduces visual clutter, reduces response time • FastBit further speeds up the response time further

  9. FastBit Speeds up Historgraming Lower is better ~ 104 X • Time needed to compute desired histograms • Custom code that directly uses the raw data directly • FastBit can be 1000 X faster than the custom code (left) • FastBit maintains the performance advantage on a parallel system

  10. n ligands One target protein n docking runs Hit list Name Score 1bef -16,4 4dab -12,3 4d2a -11,6 … … Match ligand with cavity Highlight 3 – Molecular Docking • Jochen Schlosser [schlosser@zbh.uni-hamburg.de]Center for Bioinformatics, University of Hamburg • Application: Structure-based virtual screening (ACS Fall 2007) Standard approach: match every ligand with every target protein New approach: using FastBit indexes to avoid brute-force matching

  11. Use of FastBit for Molecular Docking Method • Specification of the descriptor as triangle geometry • Types of interaction centers • Triangle side lengths • Interaction directions • 80 bulk dimensions • Receptors • Receptor descriptors are generated similarly • Using complementary information where necessary • Use of pharmacophore constraints on receptor triangles • Reduces number of queries • Improved query selectivity because the pharmacophore tends to be inside the protein cavity

  12. attribute(i) [0] ... … … [n] desc1 desc2 desc3 desc4 desc5 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 Bitmap index Use of FastBit for Molecular Docking Method • Indexing system • Properties of the problem: • Billions of descriptors (~ 1,000 for each ligand) • High dimensional query • Properties of bitmap indexes • Well suited for those kind of queries • Can be run stand alone • Further compression possible • FastBit uses compression • Results • TrixX-BMI is an efficient tool for virtual screening with average runtime in sub-second range • screen libraries of ligands 12 times faster than FlexX without pharmacophore constraints • With pharmacophore constraints, speedup 140 – 250

  13. Outline • Highlight of Accomplishments • Grid Collector • Query-Driven Visualization • Molecular docking • Outlook • More complex searches • Parallelization • Supporting more data formats • Integration with large framework

  14. Complex Searches • So far, FastBit software primarily handles range queries of the form “pressure > 105 and temperature between 800 and 1000” • Need to support complex types of searches • GTC data analysis: find all particles with certain energy level that have passed through a region with specified properties on the electric field • Network security: find the hosts that have contacted all identified drones within an hour of the start of an attack • Protein sequences: Identify known proteins with specified molecular weight • Catalog matching: matching records of stars and galaxies from one survey / simulation to another one • Subqueries: searching the results of previous searches

  15. Complex Searches • Extending the histograming functionality: group by, top-k, automatic computation of derived fields • Implement join algorithm • Existing bitmap indexes are efficient for filtering out the desired records for common join algorithms such as sort-merge join • Existing bitmap index based join algorithms appear promising from back-of-envelope calculation • A* algorithm: for programs such as neighborhood expansion, formulating them as joins may be not as efficient as using alternative searching algorithms, such as, A*

  16. Parallelization • For I/O dominated tasks, • Take advantage of parallel I/O system, PVFS • Better data layout to effectively utilize the I/O hardware • Active Storage, In-Situ data processing • For CPU dominated tasks, • Devise new algorithms, e.g., parallel join algorithms, new join indexes • Algorithms for GPU, Cell processor, and many-core architecture

  17. More Data Formats • Working with application specialist to integrate FastBit with their data library • H5Part: HDF5 • ROOT (?) • ADIOS • Restructure FastBit to make it easier to work with different data formats • Virtualize data sources

  18. Integrated Data Analysis Framework • Iterator for coarse grain data • Examples: ROOT and Map-Reduce • Indexing provides a way to implement a “smart iterator”, e.g., Grid Collector for STAR data analysis framework (using ROOT) • Framework for fine grain data • Tighter integration with programmatic API • Provide scripting support for productivity layer (end user)

  19. Indexes Facilitate Smart Analysis Indexes go here! Or How to make your system smarter!

More Related