Innovative Memory-Centric Tracing Algorithms for Parallel Processing

A New Approach to Parallelising Tracing Algorithms‏ Cosmin E. Oancea, Alan Mycroft & Stephen M. Watt Computer Laboratory University of Cambridge Computer Science Department University of Western Ontario

I. Motivation & High Level Goal • We study more scalable algorithms for parallel tracing: • memory management is the primary motivation, but • do not claim immediate improvements to state-of-the-art GC. • Tracing is important to computing: • sequential & flat memory model – well understood, • parallel & multi-level memory – less clear: • processor communication cost grows w.r.t. raw instr speed x P x ILP • Memory-centric algorithm for copy collection (a general form of tracing) -- free of locks on the mainline path.

I. Abstract Tracing Algorithm 1. mark and process any unmarked child of a marked node; 2. until no further marking is possible. • Assume an initialisation phase has already marked and processed some root nodes. • Implementing the implicit fix-point via worklists, yields: 1. pick a node from a worklist; 2. if unmarked then mark it, process it, and add its unmarked childreen to worklists; 3. repeat until all worklists are empty.

I. Worklist Semantics: Classical • What should worklists model? • Classical approach: processing semantics. Worklist 1 Worklist 2 Worklist 3 Worklist 4 • Worklisti stores nodes to be processed by processori!

I. Classic Algorithm while (!worklist.isEmpty()) { int ind = 0; Object from_child, to_child, to_obj = worklist.deqRand(); foreach( from_child in to_obj.fields() ) { ind++; atomic{ if(from_child.isForwarded())continue; to_child = copy(from_child); setForwardingPtr(from_child,to_child); } to_obj.setField(to_child, ind-1); queue.enqueue(to_child); } } • Two layers of synchronisation: • Worklist level – small overhead via deque (Arora et al.) or work tealing (Michael et al.)‏ • Frustrating atomic block – gives idempotent copy, thus enables the above small overhead worklist-access solutions.

I. Related Work‏ • Halstad (MultiLisp) – first parallel semi-space collector, but may lead to load imbalance. Solutions: • Object stealing: Arora et al. Flood et al., Endo et al. ... • Block-based approaches: Imai and Tick, Attanasio et al., Marlow et al., ... • Free-of-locks solutions via exploiting immutable data: Doligez and Leroy, Huelsbergen and Larus • Memory-centric solutions – studied only in the sequential case: Shuf et al., Demers et al., Chicha and Watt.

II. Memory-Centric Tracing (High Level)‏ • L == memory partition (local) size; gives the trade-off between locality of reference and load balancing. • Worklist j stores slots: the to-space address pointing to a from-space field f of the currently copied/scanned objecto &&j = ( o.f quo L ) rem N

II. Memory-Centric Tracing (High Level)‏ 1. Arrow Semantics: double ended – copy to-space, dashed – insert in queue, solid – slots pointing to fields 1. Each worklist w is owned by at most one collector c (owner)‏ 2. Forwarded slots of c: those slots belonging to a partition owned by c, but discovered by another collector. 3. Eager strategy for acquiring worklists ownership. Initially all roots are placed in worklists, if non-empty owned. Dispatching Slots to Worklists or Forwarding Queues

II. Memory-Centric Tracing Implem. • Each collector processes its forwarding queues (size F)‏ • Empty worklists are released (ownership). • Each collector processes F*P*4 items from its owned worklists (4 empirically chosen – forwarding ratio inv). • No locking when accessing worklists or when copying. • L (local partition size) gives the locality-of-reference level. • Repeat until no owned worklists && all forw. queues empty && all worklists empty.

II. Forwarding Queues on INTEL IA-32 • Implement inter-processor communication: • with P collectors have a PxP matrix of queues; entry (i,j) holds items enqueued by collector i and dequeued by j • wait-free, lock-free and mfence-free IA-32 implementation. volatile int tail=0, head=0, buff[F]; next : k -> (k+1)%F; bool enq(Address slot) { bool is_empty()‏ int new_tl=next(tail); { return head == tail; } if(new_tl == head) return false; Address deq() { buff[tail] = slot; Address slot= buff[head]; tail = new_tl; head = next(head); return true; return slot; } }

II. Forwarding Queues on INTEL IA-32 • The sequentially inconsistent pattern occurs, but algorithm still safe: • head & tail interaction – reduces to a collector failing to deq from a non-empty list (and to enq into a non-full list); • buff[tail_prev] & head==tail_prev interaction is safe because writes are not re-ordered. a = b = 0; // Initially // (two enq) || (two is_empty; deq)‏ // // Proc 1 Proc 2 // Proc i Proc j a = 1; b = 1; buff[tail]=...; head=next(head); // mfence; mfence; tail =...; if(head!=tail)‏ x = a; y = b; if(new_tl==head) ..=buff[head]; // x == 0 & y == 0!

II. Dynamic Load Balancing • Small partitions (64K) -- OK under static ownership: • grey object -- randomly distributed among the N partitions, • still gives some locality of reference (otherwise forwarding would be too expensive) • Larger partitions may need dynamic load balancing: • Partition ownership must be transferred: • A starving collector c signals nearby collectors; these may release ownership of an owned worklist w while placing an item of w on collector c's forwarding queue. • Partition stealing requires locking on the mainline path since the copy operation is not idempotent without it (Michael et al.)!

II. Optimisation; Run-Time Adaptation • Inter-collector producer-consumer relations are detected when forwarding queues are found full (F*P*4 processed items/iter): • transfer ownership to the producer collector to optimise forwarding. • Run-time adapt: monitor forw ratio (FR) & load balancing (LB): • start with large L; while poor LB decrease L • if FR > FR_MAX or L < L_MIN switch to classical!

III. Empirical Results – Small Data • Two quad-core AMD Opteron machine on small live data-sets applications against MMTK: • Time Average Antlr, Bloat, Pmd, Xalan, Fop, Jython, HsqldbS. • Heap Size = 120-200M, IFRav = 4.2, L = 64K.

III. Empirical Results – Large Data • Two quad-core AMD Opteron machine on large live data-sets applications against MMTK: • Time Average: Hsqldb, GCbench, Voronoi, TreeAdd, MST, TSP, Perimet, BH. • Heap Size > 500M, IFR average = 6.3, L = 128K.

III. Empirical Results – Eclipse • Quad-core Intel machine on Eclipse (large live data-set): • Heap Size = 500M, IFR average = (only) 2.6 for L = 512K, otherwise 2.1!

III. Empirical Results – Jython • Two quad-core AMD machine on Jython: • Heap Size = 200M, IFR average = (only) 3.0!

III. Conclusions • Memory-centric algorithms may be an important alternative to processing-centric algorithms, especially on non-homogeneous hardware. • How to explicitly represent and optimise two abstractions: locality of reference (L) and inter-processor communication (FR). L trade-offs locality for load balancing. • Robust behaviour: scales well with both data size and number of processors.

Innovative Memory-Centric Tracing Algorithms for Parallel Processing

Innovative Memory-Centric Tracing Algorithms for Parallel Processing

Presentation Transcript

An Overview of Pitch Detection Algorithms

Greedy Algorithms And Genome Rearrangements

Efficient Approximate Search on String Collections Part II

Fast Signal Processing Algorithms Week 5

UBI529

4 Greedy Algorithms

MAP Estimation Algorithms in

Matrix Multiplication and Graph Algorithms

Approach to the patient with Myopathy

Text Retrieval Algorithms

Combinatorial Algorithms

Divide-and-conquer algorithms

CSE 550 Combinatorial Algorithms and Intractability

Algorithms

Parallel Algorithms

Multiple Sequence Alignment (MSA)

Algorithms and Data Structures (CSC112)

Combinatorial Algorithms

Approach to Adrenal Incidentalomas

Genetic Algorithms

461191 Discrete Mathematics Lecture 3: Algorithms, The Integers, and Matrices

Data Structures and Algorithms