1 / 28

Why multi-threading/multi-core?

Why multi-threading/multi-core?. Clock rates are stagnant Future CPUs will be predominantly multi-thread/multi-core Xbox 360 has 3 cores PS3 will be multi-core >70% of PC sales will be multi-core by end of 2006 Most Windows Vista systems will be multi-core Two performance possibilities:

chial
Download Presentation

Why multi-threading/multi-core?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why multi-threading/multi-core? • Clock rates are stagnant • Future CPUs will be predominantly multi-thread/multi-core • Xbox 360 has 3 cores • PS3 will be multi-core • >70% of PC sales will be multi-core by end of 2006 • Most Windows Vista systems will be multi-core • Two performance possibilities: • Single-threaded? Minimal performance growth • Multi-threaded? Exponential performance growth

  2. Design for Multithreading • Good design is critical • Bad multithreading can be worse than no multithreading • Deadlocks, synchronization bugs, poor performance, etc.

  3. Bad Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

  4. Good Multithreading Physics Rendering Thread Rendering Thread Rendering Thread Game Thread Rendering Thread Game Thread Main Thread Particle Systems Animation/ Skinning Networking File I/O

  5. Another Paradigm: Cascades Frame 1 Frame 2 Frame 4 Frame 3 Thread 1 Input • Advantages: • Synchronization points are few and well-defined • Disadvantages: • Increases latency (for constant frame rate) • Needs simple (one-way) data flow Thread 2 Physics Thread 3 AI Thread 4 Rendering Thread 5 Present

  6. Typical Threaded Tasks • File Decompression • Rendering • Graphics Fluff • Physics

  7. File Decompression • Most common CPU heavy thread on the Xbox 360 • Easy to multithread • Allows use of aggressive compression to improve load times • Don’t throw a thread at a problem better solved by offline processing • Texture compression, file packing, etc.

  8. Rendering • Separate update and render threads • Rendering on multiple threads (D3DCREATE_MULTITHREADED) works poorly • Exception: Xbox 360 command buffers • Special case of cascades paradigm • Pass render state from update to render • With constant workload gives same latency, better frame rate • With increased workload gives same frame rate, worse latency

  9. Graphics Fluff • Extra graphics that doesn't affect play • Procedurally generated animating cloud textures • Cloth simulations • Dynamic ambient occlusion • Procedurally generated vegetation, etc. • Extra particles, better particle physics, etc. • Easy to synchronize • Potentially expensive, but if the core is otherwise idle...?

  10. Physics? • Could cascade from update to physics to rendering • Makes use of three threads • May be too much latency • Could run physics on many threads • Uses many threads while doing physics • May leave threads mostly idle elsewhere

  11. Overcommitted Multithreading? Physics Rendering Thread Rendering Thread Rendering Thread Game Thread Particle Systems Animation/ Skinning

  12. How Many Threads? • No more than one CPU intensive software thread per core • 3-6 on Xbox 360 • 1-? on PC (1-4 for now, need to query) • Too many busy threads adds complexity, and lowers performance • Context switches are not free • Can have many non-CPU intensive threads • I/O threads that block, or intermittent tasks

  13. Case Study: Kameo (Xbox 360) • Started single threaded • Rendering was taking half of time—put on separate thread • Two render-description buffers created to communicate from update to render • Linear read/write access for best cache usage • Doesn't copy const data • File I/O and decompress on other threads

  14. Separate Rendering Thread Update Thread Buffer 0 Buffer 1 Render Thread

  15. Case Study: Kameo (Xbox 360) • Total usage was ~2.2-2.5 cores

  16. Case Study: Project Gotham Racing • Total usage was ~2.0-3.0 cores

  17. Available Synchronization Objects • Events • Semaphores • Mutexes • Critical Sections • Don't use SuspendThread() • Some title have used this for synchronization • Can easily lead to deadlocks • Interacts badly with Visual Studio debugger

  18. Exclusive Access: Mutex // Initialize HANDLE mutex = CreateMutex(0, FALSE, 0); // Use void ManipulateSharedData() { WaitForSingleObject(mutex, INFINITE); // Manipulate stuff... ReleaseMutex(mutex); } // Destroy CloseHandle(mutex);

  19. Exclusive Access: CRITICAL_SECTION // Initialize CRITICAL_SECTION cs; InitializeCriticalSection(&cs); // Use void ManipulateSharedData() { EnterCriticalSection(&cs); // Manipulate stuff... LeaveCriticalSection(&cs); } // Destroy DeleteCriticalSection(&cs);

  20. Lockless programming • Trendy technique to use clever programming to share resources without locking • Includes InterlockedXXX(), lockless message passing, Double Checked Locking, etc. • Very hard to get right: • Compiler can reorder instructions • CPU can reorder instructions • CPU can reorder reads and writes • Not as fast as avoiding synchronization entirely

  21. Lockless Messages: Buggy void SendMessage(void* input) { // Wait for the message to be 'empty'. while (g_msg.filled) ; memcpy(g_msg.data, input, MESSAGESIZE); g_msg.filled = true; } void GetMessage() { // Wait for the message to be 'filled'. while (!g_msg.filled) ; memcpy(localMsg.data, g_msg.data, MESSAGESIZE); g_msg.filled = false; }

  22. Synchronization tips/costs: • Synchronization is moderately expensive when there is no contention • Hundreds to thousands of cycles • Synchronization can be arbitrarily expensive when there is contention! • Goals: • Synchronize rarely • Hold locks briefly • Minimize shared data

  23. Threading File I/O & Decompression • First: use large reads and asynchronous I/O • Then: consider compression to accelerate loading • Don't do format conversions etc. that are better done at build time! • Have resource proxies to allow rendering to continue

  24. File I/O Implementation Details • vector<Resource*>g_resources; • Worst design: decompressor locks g_resources while decompressing • Better design: decompressor adds resources to vector after decompressing • Still requires renderer to synch on every resource access • Best design: two Resource* vectors • Renderer has private vector, no locking required • Decompressor use shared vector, syncs when adding new Resource* • Renderer moves Resource* from shared to private vector once per frame

  25. Profiling multi-threaded apps • Need thread-aware profilers • Profiling may hide many synchronization stalls • Home-grown spin locks make profiling harder • Consider instrumenting calls to synchronization functions • Don't use locks in instrumentation • Windows: Intel VTune, AMD CodeAnalyst, and the Visual Studio Team System Profiler • Xbox 360: PIX, XbPerfView, etc.

  26. PIX timing capture

  27. Naming Threads typedef struct tagTHREADNAME_INFO { DWORD dwType; // must be 0x1000 LPCSTR szName; // pointer to name (in user addr space) DWORD dwThreadID; // thread ID (-1=caller thread) DWORD dwFlags; // reserved for future use, must be zero } THREADNAME_INFO; void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) { THREADNAME_INFO info; info.dwType = 0x1000; info.szName = szThreadName; info.dwThreadID = dwThreadID; info.dwFlags = 0; __try { RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD), (DWORD*)&info ); } __except(EXCEPTION_CONTINUE_EXECUTION) { } } SetThreadName(-1, "Main thread");

  28. Windows tips • Avoid using wglMakeCurrent or this.Invoke() • Best to do all rendering calls from a single thread • Test on multiple machines and configurations • Single-core, SMT (i.e. Hyper-Threading), Dual-core, Intel and AMD chips, Multi-socket multicore (4+ cores)

More Related