380 likes | 516 Views
This guide explores the significance of multi-core software development, emphasizing why it matters in today's technology landscape. With multi-core systems becoming the standard, optimizing your software can yield substantial performance gains. We delve into the differences between multi-threaded and multi-core architectures, offering practical insights on designing, implementing, and maintaining applications. Learn about effective planning, error handling, and communication strategies to maximize your system's efficiency. Utilize libraries and tools available for C++ to streamline your development process.
E N D
By Jon Nosacek Multi-core Software Development with examples in C++
Why should you care? • Multi-core systems are becoming the standard for all devices • Less heat • 1 core = 2 cores at half frequency using ¼ power! • (P = C × V2 × F) • Designing a new system around multi-core architecture can be quite difficult.
Why should you care? (cont) • Technology isn’t evolving like it was before • Not automatic gains • We want fast! • Our users deserve the same
Multi-threaded VS Multi-core • Same basic principle, but can yield very different results • Multi-threaded assumes no knowledge of the release environment and can make the program slower on a single-core platform • Multi-core means specifically designing your system for a platform that you know has two or more cores. Can yield significant performance boosts if done correctly
Hardware • To understand how the software works, you must first understand how the hardware works • Very much a hardware-oriented evolution (Hardware could not keep up with our increasing demands)
Why transition to multi-core? • Higher processor frequencies necessitated better cooling • There is a limit based on materials and methods • Computers are replacing us • Brain is not sequential
Why multi-core (cont) • Traditional: • Multi-core
Intel Core 2 extreme Quad http://www.techspot.com/articles-info/23/images/img2.jpg Intel Core i7 965 quad core (8 threads) http://tinyurl.com/3tgfygn
Terminology • Thread • Smallest unit of execution that a program can be broken down into • Contains all the info that is needed for it to run • Atomic Statement • Single operation by the processor. Can’t slice out during execution
Terminology (cont) • Hyper threading: (SMT) • Intel’s route of having 2 threads per core to simulate more cores and reduce CPU waste • Virtual processors not necessarily tied to physical ones • Example of hardware helping software
How to design a multi-core system • Planning • Implementation • Testing • Deployment • Maintenance
Planning • A “code-and-fix” laissez faire mentality WILL NOT WORK • Too many things to go wrong, hard to pinpoint problem post factum • Single most important step • Problems here will cascade into other steps and become worse • Clear vision is a must • How deep into threading do you want to go?
Planning (cont.) • Opportunity comes during the decomposition phase • Need to model • the state of the threads and what combinations effect each other • Thread interaction • Number of threads • More threads => more problems • Balance performance with understandability, maintainability, time • Fairness and priority • More threads => more communication
Planning (cont.) • Error handling is more important • Who handles the errors? Other threads might take a while to respond and what if everyone responds? • Synchronization and semaphores should be used sparingly. • Threads should be as independent as possible • Need to make rules on memory access • Dataflow diagrams!
Concurrent Vs Parallel Design • Which do you think is better? http://blog.rednael.com/content/binary/parallel%20vs%20concurrent.jpg
Concurrent Parallel • Easy to design and implement • Works well for IO • Minimal interaction to plan and synchronize • Less CPU waste • Even more difficult to track • CPU has to keep track and time slice more (swap time)
Implementation • Languages are becoming more and more open to multi-core programming • There are libraries for C++ that help ease the workload • A lot of threading is OS tied and Microsoft knows theirs better than anyone • Usually support goes Linux & Microsoft then Macs • Watch for CPU specific commands that can improve performance
Implementation (cont.) • Make sure resources are being managed • Update the models as the system changes • The IDE you choose during this phase can be very important and effects what you see your system doing • Using existing libraries usually reduces workload and are often more efficient • Make sure all basic/shared initializations are done before the threads are created
Implementation (cont.) • Watch for evolving trends • If a lot of communication is going on between two threads, see if things can be merged/swapped • See which threads take up the most resources and what will increase program responsiveness • Keep the future in mind • More cores will always be added. • Think about the simplest case and expand into the complex • Also realize that more features are being added to C++ to help abstract multithreading
// Basic example: #include < iostream > #include < pthread.h > void *task1(void *X) //define task to be executed by ThreadA { cout < < “Thread A complete” < < endl; return (NULL); } void *task2(void *X) //define task to be executed by ThreadB { cout < < “Thread B complete” < < endl; return (NULL); } int main(int argc, char *argv[]) { pthread_tThreadA,ThreadB; // declare threads pthread_create( & ThreadA,NULL,task1,NULL); // create threads pthread_create( & ThreadB,NULL,task2,NULL); pthread_join(ThreadA,NULL); // wait for threads to “join up” pthread_join(ThreadB,NULL); return (0); }
// Doing little things can make a big difference too: array<int, 4> a = { 24, 26, 41, 42 }; vector<tuple<int,int>> results1; concurrent_vector<tuple<int,int>> results2; elapsed = time_call([&] { for_each (a.begin(), a.end(), [&](int n) { results1.push_back(make_tuple(n, fibonacci(n))); }); }); elapsed = time_call([&] { parallel_for_each (a.begin(), a.end(), [&](int n) { results2.push_back(make_tuple(n, fibonacci(n))); });}); // a 4 core system outputs: 9250 ms, 5726 ms
Testing • Race conditions are the most prevalent • Identify critical paths • Balance threads and tweak for performance • Non-determinism (for some initial state, the final state is ambiguously determined)
Deployment • Mostly the same • See what platforms are actually using you program and tune as necessary
Maintenance • Need to keep up with the changing tech (still pretty new) • Adding new functionality will be more difficult especially when it’s very different from existing. • Much more testing needed • Going back to the original plan and seeing how new features fit in and what is effected is much more important
Maintenance (cont.) • What about adding to an existing system? • Very difficult • Should focus on largest time consumers (IO, disk, complex algorithms) • Applications with low coupling are the best to add parallel aspects
Challenges • Lots of planning needed • Thorough understanding of the environment • Very hard to debug • Built in support is hit-and-miss (language & IDE) • Security concerns (from other programs as well as your own) • A lot of life-critical embedded systems are sticking with single core platforms
What apps can help me out? • Intel’s Threading Building Blocks • OpenMP • Microsoft Visual Studio • MULTI-Green Hills • Total View - Rogue Wave
Intel’s Threading Building Blocks • Template Library • Algorithms, containers, mutex, atomic statements, timing, scheduling • Implements “Task Stealing” • If one core is idle, it will take a scheduled task from another to reduce CPU waste • Automatically creates the threads for you to maximize performance • Much like parallel_for • Tries to be like the STL • ease of use, generality, but more aggressive
Intel’s Threading Building Blocks (cont.) • A bit more memory/cache oriented than STL • Intel knows their own cores and how to schedule on them • Adds a lot more concurrency-oriented data types (concurrent_queue, concurrent_vector, concurrent_hash_map) • Also geared for easy scalability • More atomic operations (also from knowing their own cores) • Follows a pipe-line architecture like graphics
OpenMP int th_id, nthreads; #pragma omp parallel private(th_id) shared(nthreads) { th_id = omp_get_thread_num(); #pragma omp critical { cout << "Hello World from thread " << th_id << '\n'; } #pragma omp barrier #pragma omp master { nthreads = omp_get_num_threads(); cout << "There are " << nthreads << " threads" << '\n'; } }
Microsoft Visual Studio • Thread View
MULTI IDE – Green Hills • Cool debugging/recording features http://www.ghs.com/products/MULTI_IDE.html
Total View - Rogue Wave • Thread viewer:
Sources: • Buttari, Alfredo, Jack Dongarra, Jakub Kurzak et all. The Impact of Multicore on Math Software • Hughes, Cameron, and Tracey Hughes. Professional Multicore Programming Design and Implementation for C++ Developers. Indianapolis, IN: Wiley Pub., 2008. • http://msdn.microsoft.com/en-us/concurrency/default.aspx • http://channel9.msdn.com/search?term=concurrency • http://www.cs.kent.edu/~farrell/amc09/lectures/
Any Questions? • This is all sounds like a lot of work. Why should we bother when something easier might come along? • It’s very much a game of figuring out how much effort gets the largest returns. • True progress will take both EE’s and SE’s (and CS’s too if any showed up today) • Might be a long time before we see change