1 / 60

Optimizing Sharing Patterns and Locality via Thread Migration

Optimizing Sharing Patterns and Locality via Thread Migration. Vadim Gleizer Supervisor: Prof. Assaf Schuster. Contributions of this research. Internal Distributed Shared Memory (DSM) Mechanisms Thread Migration (TM) in DSM Systems Load Balancing in DSM Systems. Internal DSM Mechanisms.

dustin
Download Presentation

Optimizing Sharing Patterns and Locality via Thread Migration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Sharing Patterns and Locality via Thread Migration Vadim Gleizer Supervisor: Prof. Assaf Schuster

  2. Contributions of this research • Internal Distributed Shared Memory (DSM) Mechanisms • Thread Migration (TM) in DSM Systems • Load Balancing in DSM Systems

  3. Internal DSM Mechanisms • An internal DSM mechanism or a DSM handler is responsible to guarantee the consistent memory view on each workstation as follows: • When a DSM region becomes invalid it is protected • Each access to the protected area will cause an exception • The internal DSM mechanism catches and handles these exceptions

  4. Implementation of DSM Handlers • An exception handling service which is provided by an operating system significantly simplifies this task • Let us see the Win32 Structured Exception Handling (SEH) service of Windows NT: • A block of code that allowed to use DSM is wrapped by an exception block using Win32 __try/__except keywords similarly to try/catch blocks in C++: __try{ user_main();}__except(DSM_handler()); • Let us see how such services work and the drawbacks of using them

  5. Inside the SEH Service • For each type of exceptions CPU generates a code, e.g. division by zero has code 0; page fault has code E; a GPF (General Protection Fault) exception has code D: • In the case of a page fault exception a _KiTrap0E is called

  6. Inside the SEH Service(cont.) • The following sequence of calls occurs before the control is passed to the DSM_handler: • _KiTrap0E • KiUserExceptionDispatcher • RtlDispatchException • RtlpExecuteHandlerForException • ExecuteHandler • __except_handler3 • DSM_handler

  7. Drawbacks of using SEH in DSM Systems • Performance • The SEH service is highly time-consuming while most of its functionality is unnecessary for the DSM handler • User’s exception handlers are called before the DSM handler • The programmer may accidentally intercept a DSM exception • The internal DSM handler should work transparently to the programmer • Thus, if the programmer does not know that the DSM handler uses SEH – he/she may accidentally intercept a DSM exception

  8. User-Mode First-Chance Exception Handling • UMFC-EH: • Only kernel level part of SEH is used, i.e. the DSM_handler is called directly by _KiTrap0E • Thus, exceptions intercepted before any of the SEH user-mode functions is called: • _KiTrap0E, DSM_handler • Instead of _KiTrap0E, KiUserExceptionDispatcher, RtlDispatchException, RtlpExecuteHandlerForException, ExecuteHandler, __except_handler3, DSM_handler • To implement this scheme the detours library may be used

  9. UMFC-EH (cont.) • Advantages: • Solves both drawbacks of the SEH service • No __try/__except blocks are needed • Drawbacks: • The kernel level part of SEH still is used • All exceptions are intercepted, e.g., division by zero

  10. Kernel-Mode First-Chance Exception Handling • KMFC-EH: • Exceptions intercepted at kernel-mode by a special supervisor-level device driver, we call it DSM_filter • The DSM_filter informs the DSM_handler about DSM exceptions • Thus, the SEH service is not used

  11. KMFC-EH (cont.) • Advantages: • preserves all the advantages of the UMFC-EH scheme • SEH is not used, i.e., the CPU directly informs the DSM_filter about page fault exceptions • only page fault exceptions are intercepted • Drawbacks: • all page fault exceptions are intercepted by the DSM_filter, including those of other processes • fortunately the overhead of this drawback is low

  12. Performance Evaluation • Our experimental environment consists of the Millipede 4.0 DSM system: • cluster of eight uniprocessor workstations interconnected by a switched Myrinet LAN • Each workstation equipped with: • Pentium-II 300MHz • 128MB of RAM • 512KB of L2 cache • Windows NT 4.0 SP6 operating system • We have tested our DSM handlers on several commonly used for DSM benchmarks and microbenchmarks

  13. PerformanceEvaluation (cont.) • Microbenchmarks (100000 page faults): • Related results (Brazos):

  14. PerformanceEvaluation (cont.)

  15. PerformanceEvaluation (cont.)

  16. Thread Migration (TM) in DSM Systems Introduction: • A thread is stopped at almost every moment of its execution and launched on another machine from the same point where it was stopped • Applications of this facility: • load balancing • communication reduction • fault tolerance • cluster management • powerful programming primitive

  17. Designing a TM Mechanism • Restrictions on TM – there are some situations in which the migration makes no sense: • the thread owns some local operating system resources, e.g. a synchronization object • the thread executes a local dependent operation, e.g. prints a message • Therefore the programmer should be aware of thread migration and explicitly mark situations when a thread cannot migrate

  18. Designing a TM Mechanism (cont.) • A state of a thread consists of: • code • global data • heap data • stack data • processor’s register set • other thread specific data

  19. Designing a TM Mechanism (cont.) Host 1 Host 2 1000 1004 2000 2004 A A

  20. Designing a TM Mechanism (cont.) • Stack address translation • Drawbacks: • register values and stack values have to be investigated and probably updated (very inefficient for large stacks) • identification of pointers (correctness, a value may resemble a pointer), possible solutions: • special compiler or hardware support – more complex compiled code, often prevents compiler optimizations • special programming primitives that register all pointers – harms efficiency and simplicity of programming, limit free usage of pointers • the whole stack has to be copied at migration time

  21. Designing a TM Mechanism (cont.) • Creating all mobile threads at DSM initialization time • Advantages: • no pointer investigation and modification • Drawbacks: • lack of scalability – the maximum number of threads are created on each host • lack of portability – may not work in future versions of the same operating system and cannot be used for heterogeneous systems • the whole stack has to be copied at migration

  22. Designing a TM Mechanism (cont.) • Placement of stacks in a predefined memory region • Advantages: • no pointer investigation and modification • scalability – threads are created on application demand or at migration time • portability • Drawbacks: • the whole stack has to be copied at migration

  23. Designing a TM Mechanism (cont.) • Placement of stacks in a DSM region • Advantages: • preserves all the advantages of the previous approach • the stack has not to be copied at migration

  24. Implementation of TM • Placement of stacks in a predefined memory region or the default stack approach • the same address region is reserved at initialization time of DSM on each host • at creation each thread receives a slot for the stack according to its id • UNIX-like operating systems provide inside their thread creation API an option to control stack location • this approach is difficult to implement in Windows NT since there is no any conventional way to control stack location

  25. Implementation of TM (cont.) • Stack location control in Windows NT • an application asks the DSM system to create a thread • the thread is created in suspended state (the initial stack is empty) • the address of initial stack is obtained through its ESP register and freed • the value of the ESP register is changed to a new stack location • a pointer to the Win32 data structure –Thread Information Block (TIB)– is obtained through the FS register • two fields inside the TIB are modified accordingly: pvStackUserTop and pvStackUserBase • the thread is resumed

  26. Implementation of TM (cont.) • Placement of stacks in a DSM region • a separate region is added to DSM • a stack location of a thread is changed to be a slot inside the new DSM region similarly to the previous approach • however the stack cannot be handled as a regular DSM region

  27. host 1 host 2 thread A migrates Implementation of TM (cont.) • Why a thread’s stack cannot be handled as a regular DSM region? Let us see an example: • thread A migrates from host 1 to host 2 • the stack of thread A remains on host 1 since it is placed on DSM; therefore the first access to the stack will cause a page fault exception • DSM_handler should be called in order to bring the missing part of the stack • however the stack is protected and DSM_handler cannot be called in a regular way ...

  28. Implementation of TM (cont.) • The auxiliary stack approach: • this approach is based on the KMFC-EH technique • a memory region is allocated at initialization time of DSM on each host, called the auxiliary stacks region • page fault exceptions are intercepted by DSM_filter (driver) at kernel-level • when an exception has occurred on a stack DSM_filter changes the stack location of the thread to be a slot inside the auxiliary stacks region and calls DSM_handler • DSM_handler brings the page for the original stack, sets appropriate protection, switches the stack back and transfers control to the thread

  29. TM in the Millipede 4.0 DSM System • In sum, our TM mechanism has the following powerful features: • two TM approaches • kernel-level threads being migrated • SEH support • the FastMessages service is used to efficiently transfer of migrating threads • thread suspension and resumption are location independent and may be recursive • supporting safety of all API functions provided by Millipede 4.0 • statistics tool

  30. Performance Evaluation

  31. Performance Evaluation (cont.) • The cost of Win32 calls used in TM: Averaging over 1,000,000 instances of each call • Performance of TM in Millipede 4.0: Averaging over 1,000,000 of TMs with stack size of 176B

  32. Performance Evaluation (cont.) • Migration Time on Various Systems as function of • stack size (sec):

  33. Load Balancing (LB) in DSM Systems Introduction: • Definition of load in DSM systems: • the CPU time that a computational thread consumes • the amount of communication that the thread causes during its work • Dynamic load sharing • computes a less precise location scheme of threads, but due to the relaxed requirements can often be as efficient as dynamic load balancing

  34. 14 14 13 13 15 15 12 12 1 1 11 11 9 9 5 5 10 10 2 2 8 8 3 3 4 4 6 6 7 7 14 13 15 12 1 11 9 5 10 2 8 6 3 4 7 Introduction (cont.)

  35. Designing an LS Mechanism • The Goals of Load Sharing • A uniform distribution of threads among the stations • Minimization of communication overheads • Improving the locality of accesses • Avoiding page ping-pongs situations, in which a page is transferred frequently among several hosts

  36. Designing an LS Mechanism (cont.) • We propose a load sharing mechanism that works as a separate module, called the Load Sharing Module (LS-Module). • The LS-Module performs the following tasks: • load imbalance detection • load imbalance treatment • ping-pong detection • ping-pong treatment

  37. Designing an LS Mechanism (cont.) • Load Imbalance Detection protocol has a centralized entity called the Load Sharing Server (LS-Server) that • knows the power parameter of each host • notified by an external module on each change in the load • for each change in the load calculates two threshold values l and h of a host, in this way determining whether the host is normally loaded • begins load imbalance treatment protocol when load imbalance is detected

  38. Designing an LS Mechanism (cont.) • Load Imbalance Treatment protocol is performed by the LS-Server which decides how many threads, say n, should be migrated from an overloaded host, say H1 to balance its load • An entity called Load Sharing Client (LS-Client) that runs on each host is responsible for selecting n threads whose migration will best minimize future communication

  39. Designing an LS Mechanism (cont.) • Ping-Pong Detection protocol is performed by the Ping-Pong Client (PP- Client) entity • Each time there is an access to a remote page the PP-Client (one per host) is invoked • A ping-pong situation exists when the following two conditions are met: • local threads attempt to access a page a short time after it leaves the host • a page leaves the host a short time after it has arrived

  40. Designing an LS Mechanism (cont.) • Ping-Pong Treatment protocol is performed by a centralized Ping-Pong Server (PP-Server) entity • The PP-Server determines which group of threads is participate in a ping-pong, then it chooses a destination host and migrates the threads to this host • If too many threads participate in a ping-pong or a ping-pong is detected a short time after it has been resolved, the PP-Server decides to treat the ping-pong using  delays

  41. We have implemented the load sharing mechanism in the Millipede 4.0 DSM system Millipede 4.0 architecture The Thread-Server module The TM module The LS module: one centralized LS-Server LS-Clients (one per host) PP-Clients (one per host) LS in the Millipede 4.0 DSM System

  42. LS in the Millipede 4.0 DSM System (cont.) • Access History • In order to select the threads for migration, for each thread we keep an access history • The access history contains at most one entry for each page that was referenced by the local threads in last Tepoch time units • Obviously the access history should be updated as time passes • The access history keeps also an old history or prehistory • summarizes the old access history of a thread

  43. Page 0x0DCC 0:12:00 0:12:01 0:12:13 Thread 0 Page 0xACDC Thread 7 . . . Prehistory . . . LS in the Millipede 4.0 DSM System (cont.) • Access History Structure

  44. LS in the Millipede 4.0 DSM System (cont.) • Thread Selection Algorithm • A heuristic value h(j) is calculated for each thread j on the local host L. It takes into account the following characteristic: • Maximal frequency of remote references to pages on R • Minimal access frequency of the threads remaining in L to the pages used by the selected threads • Minimal access frequency to local pages • Maximal frequency of any remote references • Until enough threads are selected, the following procedure is performed: • The thread j having the maximal value h(j) is chosen • The heuristic value of each thread i that has not yet been selected is revised, taking into account migration of j

  45. Tunused Twaiting Tuseful send page P to Hi access page P (bring it from Hj) receive page P from Hj send page P to Hk Tunused + Tuseful < S Twaiting PPRatio = LS in the Millipede 4.0 DSM System (cont.) • Ping-Pong Detection Page ping-pong condition is: (S is called the sensitivity of the ping-pong)

  46. ·Nthpp P = f (Nth) LS in the Millipede 4.0 DSM System (cont.) • Dynamic calculation of  for page P • The value of  depends on the number of threads that are using the page and on their behavior is a constant; Nthpp the number of threads involved in the ping-pong residing on the local host Nth the total number of threads residing on the local host f (Nth) is the function of that number

  47. Performance Evaluation • We have tested the LS module on several benchmarks that are common in DSM systems, as well as on synthetic microbenchmarks specially designed for this purpose • We refer to the version of Millipede 4.0 with LS module as the LS version and to the version without the LS module as no LS version

  48. PerformanceEvaluation (cont.) • Microbenchmark applications were designed to simulate various load imbalance situations • Using microbenchmark applications we have measured the individual performance of each part of the load sharing protocol: • load imbalance treatment • ping-pong treatment: • locality optimization part • stabilization part

  49. PerformanceEvaluation (cont.) • Locality optimization protocol

More Related