Optimizing XML Processing for Grid Applications Using an Emulation Framework

Optimizing XML Processing for Grid Applications Using an Emulation Framework Rajdeep Bhowmik1, Chaitali Gupta1, Madhusudhan Govindaraju1, Aneesh Aggarwal2 1. Grid Computing Research Laboratory (GCRL), Department of Computer Science 2. Electrical and Computer Engineering State University of New York at Binghamton IPDPS 2008, Miami, Florida

Motivation • Emergence of Chip Multiprocessors (CMPs) • Need to study XML-based grid middleware and applications for • performance limitations, bottlenecks, and optimization opportunities • How should grid middleware and applications be re-structured and re-tooled for multi-core processors? • What designs will ensure that middleware and applications scale well with the increase in the number of processing cores?

McGrid • McGrid: Multi-core Grid Emulator • An emulation framework for Grid middleware • Built on top of SESC: a cycle accurate full-system multi-core simulator • Configurable for system and micro-architectural parameters • Current focus • Obtain performance results for XML-based grid middleware documents on multi-core systems

Grid Simulators • Many grid emulators and simulators exist • GridSim, Gangsim, SimGrid, MicroGrid • do not give feedback at micro-architecture levels • memory access patterns, cache coherency overheads, synchronization between the threads of the application • Some fundamental challenges for code on CMPs • fair and efficient allocation of shared resources between concurrent threads • automatic detection of independent modules • modules that can be executed in parallel

McGrid Design Goals • Micro-architectural Simulator – • Designed on top of SESC • Allows pinning of threads to specific processing cores • Provide Micro-architectural Feedback – • cache access patterns of multiple threads • cache misses for different cache sizes • invalidations due to cache coherency protocol • conflicts in accesses to shared resources • CPU cycles wasted due to synchronization

McGrid Design Goals (2) • Configurable Design • allow analysis of grid-middleware performance for different processor types used in the heterogeneous grid environment. • Configuration options • Cache and physical memory size • Processor and memory speed • Number of on-chip cores • Pipeline and pre-fetch depth in each core • Execution width of each core

Porting to Multi-core systems • Initial analysis focus • XML based documents for job submission • Event stream documents • Workflow specifications • SOAP Messages with complex types • Serialized data formats • Decomposition • Parts that need to be thread-private • Parts that can be shared among threads • Scheduling • Mix of threads executing in parallel on CMPs • Choice of core for a particular thread

XML-based Grid Middleware Design Considerations • Role of XML in Grid Middleware • Namespaces • XML Docs with Repetition of Elements • XML Docs without Repetition of Elements • Buffering • Scanning and Caching • Co-Referenced Objects and Graphs

Bio-Medical Document The element atom appears repeatedly Each atom element shares namespaces defined at the top

WS-Security Document Non sequence-based Some elements are more expensive to process

Research Questions • How should namespaces be defined and used in XML processing to avoid triggering expensive synchronization algorithms between the cores? • What are the ways to cache frequently used namespaces that result in performance gains in a multi-core processor? • For what class of grid applications will the use of multiple-threads in a multi-core processor provide significant speed-up compared to the serial processing model that is widely used for XML processing documents on a single core processor?

Research Questions (2) • What optimizations can be enabled when the size of sequence based XML documents is known in advance? • What are the algorithms that can detect the cache access pattern of the application and dynamically distribute the processing load evenly among the various cores? • This aspect of the research is part of future work

Performance Results • Experimental Setup – • SESC – a cycle accurate architectural simulator • Each core has • Private 32Kbyte 4-way set associative Level-1 data cache • Private 32Kbyte 2-way set associative Level-1 instruction cache • Private 512K 8-way set associative Level-2 cache • Cache Replacement Policy • LRU • Cache Coherence Protocol • MESI • Cache Line Size • 64-byte • For our performance tests • MIPS cross-compiler built from the tool-chain gcc 3.4, glibc-2.3.2, Linux kernel headers 2.4.15

3 Threading Approaches • Single threaded • A single thread is used on a single core • Scanned threaded • First thread scans the document • determines points of parallelism • New threads process in parallel after that • Direct threaded • Same as scanned threaded except • the scanning part is skipped • assumed that parallel processing points are known • based on processing in previous runs • same document size and type

Threading Configuration Measurements • Direct threading over single-threading: 92% for all document sizes. • Scanned-threading over single-threading: 20% for 500 element document and about 12% for 4000 element document.

Direct-threading Performance • Performance almost doubles with doubling of the number of cores. Speed-up of about 92% for 2000 and 4000 elements

Performance Impact of Caching • Performance of direct-threading for varying number of elements per core. • Processing is done by two threads running on two different cores. • Elements are evenly divided between the threads. • Results for 3 cases – • Case 1 – Document preparation and processing is done on different cores. • Case 2 – Document is prepared in the core that processes the bottom half of the elements. • Case 3 – Document is prepared in the core that processes the top half of the elements.

Performance Impact of Caching • Performance of the two processing cores for the three cases of direct-threading for various document sizes.

Results for Even and Un-even Distribution of elements with direct-threading • With even distribution of elements – • Core 1 has the shortest running time among the cores • Core 3 has the longest running time among the cores • With uneven distribution of elements – • Best performance is obtained for the distribution when the running time of all cores are equal

Performance Impact of Cache Coherency • Configuration Details • Shared data structure for XML processing • Shared hash table to process a co-referenced object • Config 1 – Each write of an element is followed by a read of the element • Config 2 – Each write of an element is followed by three reads of the element

Performance Impact of Cache Coherency • Performance for the two configurations of the shared hash table for various application document sizes and number of cores.

Table-lookup and Shared Stack based Namespace Implementations • Performance of the two configurations of the shared namespace stack for various document sizes and cores.

Conclusions • XML docs should avoid redefinition of namespaces in inner elements • prevent expensive synchronization algorithms between the various cores. • The number of elements in XML doc may have to be un-evenly divided among the multiple cores • taking into account the cache access patterns of the threads. • When size of the sequence-based document is known and can be guessed accurately, a simple threading approach of equal distribution of the elements between the threads performs the best • because the processing of the document is equally divided between the threads. • Threads must be scheduled in cores that have already cached the whole or part of the data. • Non-sequence based documents should be scanned first. • The processing loads should then balanced among the different cores.

Future Work Future work includes – • Run the emulator for a larger number of representative XML documents and grid middleware services. • Run the emulator for representative grid applications. • Study the effect of different thread scheduling schemes on cache access patterns for each core. • Quantify the benefits of parallel XML parsing techniques for different document types and sizes. • Use of the network simulator from the MicroGrid project to simulate the inter-node communication between various grid nodes.

THANK YOU !

Optimizing XML Processing for Grid Applications Using an Emulation Framework

Optimizing XML Processing for Grid Applications Using an Emulation Framework

Presentation Transcript

Using JDOM for XML processing

Grid-Brick Event Processing Framework in GEPS

Simplifying XML Processing using LINQ to XML

XPipe - An XML Processing Methodology

Optimizing Microsoft SQL Server 2008 Applications Using Table Valued Parameters, XML, and MERGE

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining

XML for Data Grid Applications

Python for XML Processing and Web Applications

XPipe - An XML Processing Methodology

An XML implementation process model for enterprise applications

XML-  : an extendible framework for manipulating XML data

Updating JUPITER framework using XML interface

Optimizing NetScaler for Enterprise Applications

Using an XML Parser

Optimizing XML Compression

Using an XML Parser

XML Evolution: Two-phase XML Processing Model Using XML Prefiltering Techniques

An XML Schema for Exchanging Grid Network Measurements

Processing XML

XML and its applications: 4. Processing XML using PHP

An Adaptive IoT Framework using FPGA Based SOC for varying Applications

XMLTK: An XML Toolkit for Scalable XML Stream Processing