1 / 15

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data. John Wu , Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National Laboratory Wei-Ming Zhang Kent State University Jerome Lauret Brookhaven National Laboratory. Outline. Overview bitmap index

vevay
Download Presentation

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National Laboratory Wei-Ming Zhang Kent State University Jerome Lauret Brookhaven National Laboratory

  2. Outline • Overview bitmap index • Introduction to FastBit • Overview of Grid Collector • Two use cases • “common” jobs • “exotic” jobs

  3. Basic Bitmap Index Data values • Compact: one bit per distinct value per object • Easy to build: faster than common B-trees • Efficient to query: only bitwise logical operations • A < 2  b0 OR b1 • 2<A<5  b3 OR b4 • Efficient for multi-dimensional queries • Use bitwise operations to combine the partial results b0 b1 b2 b3 b4 b5 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 5 3 1 2 0 4 1 =0 =1 =2 =3 =4 =5

  4. 100…111111 001…111 01000… Literal word Literal word Fill word WAH includes three words Run length is 63 An Efficient Compression Scheme-- Word-Aligned Hybrid Code 2015 bits 10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111 Groups bits into 65 31-bit groups 31 bits 31 bits … 31 bits Merge neighboring groups with identical bits 63*31 bits 31 bits 31 bits Encode each group using one word

  5. Compressed Bitmap Index Is Compact 100 M, synthetic 25 M, combustion • Expected index size of a uniform random attribute (in number of words) is smaller than typical B-trees (3N~4N) N is the number of rows, w is the number of bits per word, c is the number of distinct value, i.e., the attribute cardinality

  6. Compressed Bitmap Index Is OptimalFor 1-dimensional Query • Compressed bitmap indices are optimal for one-attribute range conditions • Query processing time using is at worst proportional to the number of hits • Only a small number of most efficient indexing schemes, such as B-tree, has this property • Bitmap indices are also efficient for multi-dimensional queries

  7. Compressed Bitmap Index Is Efficient For Multi-dimensional Queries Log-log plot of query processing time for different size queries The compressed bitmap index is at least 10X faster than B-tree and 3X faster than the projection index

  8. Data Analysis Process In STAR • Users want to analyze “some” (not all) events • Events are stored in millions of files • Files distributed on many storage systems • To perform an analysis, a user needs to • Prepare an analysis • Write the analysis code • Specify the events of interest • Run an analysis • Locate the files containing the events of interest • Prepare disk space for the files • Transfer the files to the disks • Recover from any errors • Read the events of interest from files • Remove the files

  9. Components of the Grid Collector Legend: red– new components, purple – existing components • Locate the files containing the events of interest • Event Catalog, file & replica catalogs • Prepare disk space and transfer • Prepare disk space for the files • Disk Resource Manager (DRM) • Transfer the files to the disks • Hierarchical Resource Manager (HRM) to access HPSS • On-demand transfers from HRM to DRM • Recover from any errors • HRM recovers from HPSS failures • DRM recovers from network transfer failures • Read the events of interest from files • Event Iterator with fast forward capability • Remove the files • DRM performs garbage collection using pinning and lifetime Consistent with otherSRM based strategies and tools

  10. Servers Clients Grid Collector Administrator Index Builder In: STAR tag file Out: bitmap index Replica Catalog File Locator In: logical name, Out: physical location Fetch tag file Load subset Rollback Commit Replica Catalog Event Catalog In: conditions Out: logical files, event IDs File Scheduler In: physical file Analysis code New query Event iterator HRM 1 DRM NFS, local disk HRM 2 Grid Collector: Architecture

  11. FastBit Index For Event Catalog • For 13 million events in a 62 GeV production (STAR 2004) • Event Catalog size (include base data and bitmap indices): 27 GB • tags: 6.0 GB (part of the base data of Event Catalog) • MuDST: 4.1 TB • event: 8.6 TB • raw: 14.6 TB • Time to produce tags, MuDST and event files from raw data: 3.5 months, 300+ CPUs • Time to build the catalog: 5 days, one CPU

  12. Grid Collector Speeds up Reading • Test machine: 2.8 GHz Xeon, 27 MB/s read speed • Without Grid Collector, an analysis job reads all events • Speedup = time to read all events / time to read selected events with Grid Collector • Observed speedup ≥ 1 • When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster

  13. Grid Collector Speeds Up Actual Analysis • Real analysis jobs typically include its own filtering mechanisms • Real analysis jobs may also spend significant amount of time perform computation • On a set of “real” analysis jobs that typically select about 10% of events, using Grid Collector has a speedup of 2 for CPU time, 1.4 for elapsed time. • Speedup = time used with existing filtering mechanism / time used with GC selecting the same events • Tested on flow analysis jobs • Test data set contains 51 MuDST files, 8 GB, 25,000 events (P04ij) • Test data uses an efficient organization that enhances the filtering mechanism – reads part of the event data for filtering Speeding all jobs by 1.4 means the same computer center can accommodate 40% more analysis jobs

  14. Searching for anti-3He Lee Barnby, Birmingham Initial study identified collision events that possibly contain anti-3He, need further analysis (2000) Searching for strangelet Aihong Tang, BNL Initial study identified events that may indicate existence of strangelets, need further investigation (2000) Grid Collector Enables Hard Analysis Jobs • Without Grid Collector, one has to retrieve every file from HPSS and scan them for the wanted events – may take weeks or months, NO ONE WANTS TO DO IT • With Grid Collector, both completed in a day

  15. Summary Grid Collector • Makes use of two distinct technologies, • FastBit, • And SRM (Storage Resource Manager) • To speed up common analysis jobs where files are already on disk, • And, enable difficult analysis jobs where some files may not be on disk. • Contact Information • John Wu John.Wu@nersc.gov • Jerome Lauret JLauret@bnl.gov

More Related