1 / 82

Physics in Parallel: Simulation on 7th Generation Hardware

Physics in Parallel: Simulation on 7th Generation Hardware . David Wu Pseudo Interactive. Why are we here?. The 7th generation is approaching. We are no longer next gen We are all scrambling to adopt to the new stuff, so that we can stay on the bleeding edge And push the envelope

ulfah
Download Presentation

Physics in Parallel: Simulation on 7th Generation Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive

  2. Why are we here? • The 7th generation is approaching. • We are no longer next gen • We are all scrambling to adopt to the new stuff, so that we can stay • on the bleeding edge • And push the envelope • and take things to the next level.

  3. What’s Next Gen? • Multiple Processors • not entirely new, but more than before. • Parallelism • not entirely new, but more than before. • Physics • not entirely new, but more than before.

  4. Take-Away • So much to cover • General Principles • Useful Concepts • Techniques • Tips • Bad Jokes • Goal is to save you time during the transition to.. • Next Gen

  5. Format for presentation Every year we discover new ways to communicate information.

  6. Patterns • A description of a recurrent problem and of the core of possible solutions • Difficult to write • Too pretentious • Inviting criticism

  7. Gems • Valuable bits of information • Too 6th Gen

  8. Blog • Free Form • Continuity not required • Subjective/opinionated is okay • Arbitrary Tangents are okay • Catchy Title need not match article • No quality bar • This sounds 7th Gen to me.

  9. Disclaimer • My information sources range from: • press releases • Patents • other Blogs on the net • random probabilistic guesses. • Much of the information is probably wrong.

  10. 1-Mar-05 Multi-threaded programming • I participated in some in depth discussions on this topic, after weeks of debate, the conclusion was: • “Multi-threaded programming is hard”

  11. 2-Mar-05 What is 7th Gen Hardware? • Fast • Many parallel processors • Very High peak Flops • In order execution

  12. 2-Mar-05 What is 7th Gen Hardware? • High memory latency • Not enough Bandwidth • Moderate clock speed improvements • Not enough Memory • CPU-GPU convergence

  13. 3-Mar-05 Hardware usually sucks • Is Multi-Processor Revolutionary? • It is kind of here already • Hyper Threading • Dual Processor • Sega Saturn • not entirely new, but more than before.

  14. 3-Mar-05 Hardware usually sucks • Hardware advances require years of preparatory hype: • 3D Accelerators • Online • SIMD • “Not with a bang but with a whimper”

  15. 3-Mar-05 Hardware usually sucks • The big problem with hardware advances is sofware. • We don’t like to do things that are hard. • If there is a big enough payoff we do it. • This time there is a big enough payoff.

  16. 4-Mar-05 Types of Parallelism • Task Parallelism • Render+physics • Data Parallelism • collision detection on two objects at a time • Instruction Parallelism • multiple elements in a vector • Use all three

  17. 5-Mar-05 Techniques Pipeline Work Crew Forking

  18. 5-Mar-05 Pipeline – Task Parallelism • Subdivide problem into discrete tasks • Solve tasks in parallel, spreading them across multiple processors.

  19. 5-Mar-05 Pipeline – Task Parallelism Thread 0 collision detection Frame 3 Thread 0 collision detection Frame 4 Thread 1 Logic/AI Frame 2 Thread 1 Logic/AI Frame 3 Thread 2 Integration Frame 1 Thread 2 Integration Frame 2

  20. 5-Mar-05 Pipeline Similar to CPU/GPU parallelism CPU Frame 3 CPU Frame 4 GPU Frame 2 GPU Frame 3

  21. 5-Mar-05 Pipeline: notes • Dependencies explicit • Communication explicit • I.e. through FIFO • Avoids deadlock issues • Avoids most race conditions • Load balancing is not great • Does not reduce latency vs. singled threaded case

  22. 5-Mar-05 Pipeline: notes • Feedback between tasks is difficult • Best for open loop tasks • Secondary dynamics, I.e. pony tail • Effects • Suitable for specialized hardware, because task requirements are cleanly divided.

  23. 5-Mar-05 Pipeline: notes • Suitable for restricted memory architectures, as seen in a certain proposed 7th gen console design. • Adds bandwidth overhead and memory use overhead to SMP systems that would otherwise communicate via the cache.

  24. 5-Mar-05 Work Crew Component wise division of system Collision Detection Integration Particle System Fluid Simulation Audio Rendering AI/Logic IO

  25. 5-Mar-05 Work Crew – Task Parallelism • Similar to pipeline but without explicit ordering. • Dependencies are handled on a case by case basis. • i.e. particles that do not effect game play might not need to be deterministic, so they can run without explicit synchronization. • Components without interdependencies can run asynchronously, e.g. kinematics and AI.

  26. 5-Mar-05 Work Crew • Suitable for some external processes such as IO, Gamepad, Sound, Sockets. • Suitable for decoupled systems: • particle simulations that do not effect game play • Fluid dynamics • Visual damage simulation • Cloth simulation

  27. 5-Mar-05 Work Crew • Scalability is limited by the number of discrete tasks • Load balancing is limited by the asymmetric nature of the components and their requirements. • Higher risk of deadlocks • Higher risk of race conditions

  28. 5-Mar-05 Work Crew • May require double buffering of some data to avoid race conditions. • Poor data coherency • Good code coherency

  29. 5-Mar-05 Forking – Data Parallelism • Perform the same task on multiple objects in parallel. • Thread “forks” into multiple threads across multiple processors • All threads repeatedly grab pending objects indiscriminately and execute the task on them • When finished, threads combine back into the original thread.

  30. 5-Mar-05 Forking Fork Object A Thread 2 Object B Thread 0 Object C Thread 1 combine

  31. 5-Mar-05 Forking • Task assignment can often be done using simple interlocked primitives: • I.e. Int i = InterlockedIncrement(&nextTodo); • OpenMP adds compiler support for this via pragmas

  32. 5-Mar-05 Forking • Externally Synchronous • external callers don’t have to worry about being thread safe • thread safety requirements are limited to the scope of the code within the forked section. • This is a big deal. • good for isolated engine components and middle ware

  33. 5-Mar-05 Forking – Example AI running in thread 0 AI calls RayQuery() for a line of sight check RayQuery forks into 6 threads, computes the ray query, and then returns the results through thread 0 AI, running in thread 0 uses the result.

  34. 5-Mar-05 Forking • Minimizes Latency for a given task • Good data and code coherency • Potentially high synchronization overhead, depending on the coupling. • Highly scalable if you have many tasks with few dependencies • Ideal for Collision detection.

  35. 5-Mar-05 Forking - Batches Reduces inter-thread communication Reduces potential for load balancing. Improves Instruction level parallelism Fork Objects 21..30 Thread 2 Objects 0..10 Thread 0 Objects 11..20 Thread 1 combine

  36. 6-Mar-05 Our Approach 1) Collision Detection Forked 2) AI/Logic Single threaded 2b) Damage Effects Contractor Queue All extra threads 2a) engine calls Forked 3) Integration Forked Audio Whatever 4) Rendering Forked/Pipeline

  37. 7-Mar-05 Multithreaded programming is Hard • Solutions that directly expose multiple threads to leaf code are a bad idea. • Sequential, single threaded, synchronous code is the fastest to write and debug • In order to meet schedules most leaf code will stay this way.

  38. 7-Mar-05 Notes on Collision detection • All collision prims are stored in a global search tree. • Bounding Kdop tree with 8 children per node. • The most common case is when 0 or 1 children need to be traversed • 8 children results in fewer branches • 8 Children allows better Prefetching

  39. 7-Mar-05 Collision detection • Each moving object is a “task” • Each object is independently queried vs. all other objects in the tree. • Results are output to a global list of contacts and collisions • To avoid duplicates, moving object vs. moving object collisions are only processed if the active moving object’s memory address is <= the other moving object.

  40. 7-Mar-05 Collision detection • Threads pop objects off of the todo list one by one using interlocked access until they are all processed. • Each query takes O(lgN) time. • Very little data contention • output operations are rare and quick • task allocation uses InterlockedIncrement • On 2 Cpus with many objects I see a 80% performance increase. • Hopefully scalable to many CPUs

  41. 7-Mar-05 Collision detection • We try to keep collision code and data in the cache as much as possible • We try to finish Collision detection as soon as possible because there are dependencies on it • All threads attack the problem at once

  42. 8-Mar-05 Notes on Integration The process that steps objects forward in time, in a manner consistent with all contacts and constraints.

  43. 8-Mar-05 Integration • Each batch of coupled objects is a task. • Each Batch is solved independently • Threads pop batches with no dependencies off of the todo list one by one using interlocked access until they are all processed.

  44. 8-Mar-05 Integration • When a dynamic object does not interact with other dynamic objects, it’s batch contains only that object. • When dynamic objects interact, they are coupled, their solutions are dependant on each other and they most be solved together.

  45. 8-Mar-05 Integration • In some cases, objects can be artificially decoupled. • I.e. assume object A weighs 2000kg, and object B weighs 1 kg. In some cases we can assume that the dynamics of B do not effect the dynamics of A. • In this case, A can first be solved independently, and the resulting dynamics can be fed into the solution for B. • This creates an ordering dependency. • A must be solved before B.

  46. 8-Mar-05 Integration • When objects are moved they must be updated in the global collision tree. • Transactions need to be atomic, this is accomplished with locks/critical sections • Ditto for the VSD tree • Task allocation is slightly more complex due to dependencies • Despite all this we see a 75% performance increase on 2 CPUs with many objects.

  47. 8-Mar-05 Integration • We use a discrete newton solver, which works okay with our task dicretization • I.e. One thread per batch • If there where hundreds of processors and not as many batches, we would fork the solver itself and use Jacobi iterations

  48. 9-Mar-05 Transactions • With fine grained data parallelism, we require many, light weight atomic transactions. • For this we use either: • Interlocked primitives • Critical Sections • Spin Locks

  49. 9-Mar-05 Transactions • Whenever possible, interlocked primitives are used. • Interlocked primitives are simple atomic transactions on single words • If the transaction is short a spin Lock is used. • Otherwise a critical section is used. • A Spin Lock is like a critical section, except that it spins rather than sleeps when blocking

  50. 9-Mar-05 CPU’s are difficult There are some processor specific nuances to consider when writing your own locks: Due to out of order reads, data access following the acquisition of a lock should be proceeded by a load fence or isync. Otherwise the processor might preload old data that changes right before the lock is released.

More Related