maximizing desktop application performance on dual core pc platforms n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Maximizing Desktop Application Performance on Dual-Core PC Platforms PowerPoint Presentation
Download Presentation
Maximizing Desktop Application Performance on Dual-Core PC Platforms

Loading in 2 Seconds...

play fullscreen
1 / 45

Maximizing Desktop Application Performance on Dual-Core PC Platforms - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Maximizing Desktop Application Performance on Dual-Core PC Platforms. Richard A. Brunner AMD Fellow Advanced Micro Devices. Session Outline. Thread-Level Parallelism Introduction Processor techniques for Thread-Level Parallelism AMD Dual-Core Technology Silicon Basics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Maximizing Desktop Application Performance on Dual-Core PC Platforms' - Mercy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
maximizing desktop application performance on dual core pc platforms

Maximizing Desktop ApplicationPerformance on Dual-Core PCPlatforms

Richard A. Brunner

AMD Fellow

Advanced Micro Devices

session outline
Session Outline
  • Thread-Level Parallelism
    • Introduction
    • Processor techniques for Thread-Level Parallelism
  • AMD Dual-Core Technology
    • Silicon Basics
    • How Software detects AMD Dual-Core Technology
  • Multi-Threaded Programming
    • Programming Observations & Amdahl’s Law
    • Relevant Microsoft Windows® API
    • Licensing
    • Demo
it s about threads 1
It’s About Threads (1) …
  • To understand why AMD Dual-Core technology matters, we have to review some notions about a “thread”
  • Thread = one sequential set of program steps
    • Typical program has just 1 sequential set of steps => so typical program is “single-threaded”, e.g:
    • Thread 0: for (i=0; i<3*N; ++i){a[i]=b[i]*c[i];}
  • Multi-threading = rewrite program into set of independent steps that can execute in parallel, e.g:
    • Thread 0: for (i=0; i<N; ++i){a[i]=b[i]*c[i];}
    • Thread 1: for (i=N; i<2*N; ++i){a[i]=b[i]*c[i];}
    • Thread 2: for (i=2*N; i<3*N; ++i){a[i]=b[i]*c[i];}
  • A Process (main program) can have 1 or more threads:
    • Each thread is independent set of program steps
    • Has private program counter, stack, and storage
    • Shares the process address space & attributes with other threads
it s about threads 2
It’s About Threads (2) …

B4

B2

A4

A3

A2

A1

B1

B3

Time

Executes

Executes

Thread Asoftwarecode

CPU

CPU

CPU

CPU

Thread Bsoftwarecode

Executes

Executes

Not Executing

Not Executing

Not Executing

Not Executing

MULTI-TASKING (Time-slicing) on Single-CPU

  • When an operating system time-slices between programs, it’s actually time-slicing between threads of many programs at once
    • It doesn’t really matter whether the time-sliced threads are from the same program or not
improving thread performance 1
Improving Thread Performance (1)

Memorybandwidth

Pipeline Length

Frequency

DDR Memory

Controller

L2

Cache

AMD 64-bit

ProcessorCore

L1Instr

Cache

L1

Data

Cache

HyperTransport™

CacheSize and Hierarchies

MoreExecutionunits

I/Obandwidth

  • To improve performance of programs, improve performance of threads
    • Lots of tricks have been tried …
    • Each trick eventually hit a brick wall due to a combination of physics, pricing, and perplexity
    • Especially want to improve performance on single-processor systems
  • What’s next? Run more threads in parallel …
improving thread performance 2
Improving Thread Performance (2)
  • Improve performance of the system and application by running more threads in parallel
    • i.e., increase “Thread-Level Parallelism” (TLP)
    • Across a system, this improves throughput (number of jobs per unit time)
    • Across an application, this reduces application time
  • Improving TLP requires Software co-operation:
    • Windows already has lots of threads to run across programs to use these hardware tricks
    • To benefit even more from TLP, applications need to be re-written to be multi-threaded
    • We’ll review the issues, later …
improving thread performance 3
Improving Thread Performance (3)
  • Improving TLP requires CPU (hardware) innovation. Some Techniques we’ll review:
    • SMP: Symmetric Multi-Processing
    • SMT: Simultaneous Multi-Threading (Hyperthreading)
    • CMP: Core Multi-Processing (Dual and Multi Core)
  • AMD is introducing Dual-Core technology as a hardware method to improve TLP
  • Allows AMD to:
    • Keep offering logical, evolutionary performance improvements
    • Meet system architecture demands of our customer.
symmetric multi processing
Symmetric Multi-Processing

B4

A4

A3

A2

A1

B3

Processor 0

Thread Asoftwarecode

Executes

Processor 1

Thread Bsoftwarecode

B1

B2

Executes

Single-Core Multi-Processor

  • Run threads on traditional, single-core multiprocessor system
  • Each thread can use full CPU resources when executing.
  • Well-known technique; has been used for decades.
  • Well supported by Windows
  • Example: 2-way AMD Opteron™ processor
  • Great for Servers and Workstations.
simultaneous multi threading
Simultaneous Multi-Threading

Physical Processor

Dedicated

SW Regs

Dedicated

SW Regs

0.5 L1

I-cache

0.5 L1

I-cache

ITLB

ITLB

Logical CPU 1

Execution Units

D-Cache, DTLB

L2 Cache

Logical CPU 0

  • Divides hardware resources of one physical processor core into “N” number of mostly independent logical processor “cores”
  • Operating system views logical cores as “real” cores; schedules threads on each
    • Uses a slightly-modified SMP model for OS scheduling
    • So N threads can run in parallel
  • A “logical” core is implemented as a combination of dedicated and shared “real” hardware in a physical core.
    • Tries to capitalize on the fact that for some applications, a single thread under-utilizes available processor resources.
    • Under utilization made worse due to long memory latency
    • Non-optimal for many real-world applications
intel hyperthreading
Intel HyperThreading

Physical Processor

Dedicated

SW Regs

Dedicated

SW Regs

0.5 L1

I-cache

0.5 L1

I-cache

ITLB

ITLB

Logical CPU 1

Execution Units

D-Cache, DTLB

L2 Cache

Logical CPU 0

  • Intel HyperThreading on Pentium® 4 is SMT: One Physical Processor core partitioned into two logical cores
  • Each logical core has its own copy of software-visible registers & Instruction TLBs. Each logical CPU gets half of L1 I-Cache
  • Logical CPUs share DCache, L2, DTLBs, integer & FP units
  • Sharing can lead to cache-thrashing between threads
  • Sharing can lead to resource contention for arithmetic units
core multi processing
Core Multi-Processing

B4

A3

A2

A1

A4

B3

Core0

Thread Asoftwarecode

Executes

Core1

Thread Bsoftwarecode

B1

B2

Executes

Single Processor

  • Put multiple “real” cores in one processor package
    • Each thread can use full CPU resources when executing
  • Operating System schedules threads on each like SMP
    • Uses a slightly-modified SMP model for OS scheduling
  • Great For Desktop!
    • Brings the benefits of SMP at the cost and form-factor of uni-processor desktop
  • Great for Multiprocessor Servers and Workstations
    • Brings the benefit of an N-way system at the cost and form factor of an (N/2)-way system
  • Example: AMD Dual-Core processor
introducing amd64 dual core processor
Introducing AMD64 Dual Core Processor

Core 0

1-MB L2

Northbridge

1-MB L2

Core 1

  • Two AMD Opteron™ CPU cores on one single die, each with 1MB L2 cache
  • 90nm, ~205 million transistors*
    • Approximately same die size as 130nm single-core AMD Opteron processor*
  • 95 watt power envelope fits into 90nm power infrastructure
  • 940 Socket compatible
    • AMD expects to be first to introduce dual-core for the one- to eight-processor server and workstation market in mid-2005
  • Dual-core processors for client market are expected to follow

*Based on current revisions of the design

designed from the start to add second core
Designed From The Start To Add Second Core
  • Shared Northbridge
    • 3 HyperTransport™ technology links
    • Dual-channel (128 bit) DDR i/f
  • AMD Opteron CPU with Direct Connect Architecture was designed as CMP from the start
    • 2nd port on SRI, request management, 2 APICs
  • Two complete CPU cores
    • SMP model
    • Simpler, less restrictive programming model than ‘logical core’ approach
    • no need to “pause” one core to give other exclusive use of shared resources

Existing AMD64Processor Design

1MB

L2 Cache

1MB

L2 Cache

SRI

Core 0

Core 1

X-bar

DDR1 DRAMInterface

HyperTransport™

Links 0,1,2

processor versus core
Processor versus Core

Processor

Core

Physical packaged die that plugs into a socket on the motherboard that contains 1 or more cores.

1 complete private set of registers, execution units, and retirement queues needed to execute x86 programs; managed & scheduled as single x86 processing resource by the OS.

  • CPU Numbering scheme uses LSBs of Initial APIC ID to distinguish cores in one processor package.
    • High-order bits distinguish packages
    • Initial APIC ID provided by CPUID (eax=1)
  • Example: 2-Processors/4-Core system means 2 processors populate 2 sockets with 2 cores per processor as above

CORE 010

CORE 011

CORE 000

CORE 001

amd direct connect architecture dual core
AMD Direct Connect Architecture + Dual-Core

16x16

16x16

Opteron

Opteron

Opteron

Opteron

800

800

800

800

CORE 0

CORE 1

PCI-E

PCI-E

South

Bridge

DDR1

  • AMD Direct Connect Architecture
    • Everything connected directly to processor
    • Reduces system architecture bottlenecks
    • Further reduces latency by directly connecting two cores on same die
  • Demo of AMD Opteron™ dual-core processor-based systems on August 31, 2004
    • World’s first demonstration of x86-class dual-core processor
    • 4 processor/8 core systems running Windows®

Chipset

traditional fsb system architecture
Traditional FSB System Architecture

PCI-X

Bridge

Server

Processor

Memory access delayed

by passing through

northbridge

I/O & memory compete

for CPU’s FSB B/W

North

Bridge

PCI-X

DDR

DDR

B/W bottlenecks:

link B/W < I/O device B/W

More ChipsNeeded forBasic Server

South

Bridge

IDE, FDC,

USB, Etc.

PCI

amd64 processor with direct connect architecture
AMD64 Processor with Direct Connect Architecture

AMD

Opteron™

Processor

DDR

HyperTransport™technology for glueless I/O or CPU expansion

DDR

HyperTransport bus

has ample bandwidth

for I/O devices

Separate memory and

I/O paths eliminate most bus contention

PCI-X / PCIe

Bridge

Fewer chips needed

for basic server

AMD-8111™I/O

Hub

IDE, FDC,

USB, Etc.

PCI

traditional fsb system architecture1
Traditional FSB System Architecture
  • System scalability limited by Northbridge
  • Max of 4 processors
  • Processors compete for FSB bandwidth
  • Memory size and bandwidth are limited
  • Max of 3 PCI-X bridges
  • Many more chips required

Processor

Processor

Processor

Processor

PCI-X

Bridge

DDR

PCI-X

Memory

Expander

North

Bridge

PCI-X

Bridge

PCI-X

DDR

Memory

Expander

PCI-X

Bridge

PCI-X

IDE, FDC,

USB, Etc.

South

Bridge

PCI

800 series amd opteron processor based server
800-Series AMD Opteron™ Processor-based Server

DDR

DDR

cHT [1]

AMD

Opteron™

AMD

Opteron™

144-Bit Reg

DDR

  • Idle Latencies to First Data*
  • 1P System: <59ns
  • 0-Hop in 4P System: ~85ns
  • 1-Hop in 4P System: <95ns
  • 2-Hop in 4P System: <127ns

cHT [1]

cHT [1]

DDR

DDR

cHT [1]

AMD

Opteron™

AMD

Opteron™

HT [3]

HT [2]

PCI-33

AMD-

8111™

VGA

AMD-

8131™

PCI-X

PCI-X

PCI-X

LPC

FLSH

64-bits @

133MHz

64-bits @

133MHz

USB

ENET

AC97

IDE

BMC

HT[4]

SIO

64-bits @

66MHz

AMD-

8131™

PCI-X

Gbit Enet

PCI-X

SCSI

[1] = 16x16 Coherent HyperTransport™ @ 2000MT/s

PCI-X

[2] = 16x16 HyperTransport @ 2000MT/s

Gbit Enet

[3] = 8x8 HyperTransport @ 400MT/s

[4] = 8x8 HyperTransport @ 1600MT/s

(2.8GHz CPU, 200MHz PC3200 DRAM (closed page) , 1000MHz HT)

sse3 support
SSE3 Support

AMD dual-core processors are designed to support SSE3

Supports SSE3 instructions reported by CPUID.SSE3 feature flag

10 new SSE instructions and 1 new x87 instruction (13 total opcodes)

No Monitor or Mwait for Hyperthreading

which have separate CPUID flag anyway

ADDSUB[PD,PS] xmm1, xmm2/m128

Provides interleaved packed add and subtract

FISTTP m16int/m32int/m64int

Like FISTP but with forced Truncation

HADD[PD,PS] xmm1, xmm2/m128

Horizontal Adds

HSUB[PD,PS] xmm1, xmm2/m128

Horizontal Subtracts

LDDQU xmm, m128

Special 128-bit Unaligned Load

MOV[DD,SHD,SLD]UP xmm1, xmm2/m64

Move and Duplicate some elements

how can software detect amd dual core
How Can Software Detect AMD Dual-Core?
  • Same steps as detecting SMP or SMT on x86/AMD64
    • OS Kernel uses information from BIOS or reads special hardware registers to get number of CPUs (cores) in system
    • Each core has unique APIC ID assigned by BIOS
    • BIOS records CPU info in ACPI-MADT and MPS tables
    • BIOS records MP topology info in ACPI-SRAT and ACPI-SLIT
  • OS and App code also need to determine the number of physical (or logical) cores per processor
    • Information is key for efficient thread scheduling and memory allocation
  • Existing (legacy) software only expects logical cores. It uses the x86 “CPUID” instruction to get that number
    • So, AMD reports physical cores as logical cores for this form of CPUID. Lets legacy software exploit physical cores w/o change.
legacy cmp cpuid support
Legacy CMP/CPUID Support
  • Legacy software uses CPUID (eax=1) to get number of logical cores. AMD’s CPUID reports physical cores in same way:
    • CPUID.HTT=1 (edx[28])
    • CPUID.logical_number_of_processors = 2 (ebx[23:16])
  • Legacy software support for 2-logical cores, while more restrictive, appears to work equally fine for 2-physical cores
    • Hyperthreading scheduling rules work fine for multi-core
    • AMD has tested this model heavily with legacy software and expects no major problems
    • Migrating from hyperthreading rules to less restrictive multi-core rules becomes an optimization, not a requirement
  • New extended CPUID Feature bit, LEGACY_CMP, tells new software if the HTT fields above report Hyperthreading
    • LEGACY_CMP will be ‘1’ on AMD dual core indicating no HTT support
operating system support for dual core
Operating System Support for Dual-Core
  • Windows XP Home, Windows XP Pro, Windows 2003 (32-bit and 64-bit) support AMD Dual-Core using CMP Legacy Mode
    • First AMD dual-core silicon using this model booted Windows® within hours
    • Recent OS distributions that support Hyperthreading are expected to work well
  • New extended CPUID function (eax=8000_0008) returns on any core the number of physical cores per processor
    • Correct way for future OS and application software
to thread or not to thread
To Thread or Not To Thread?
  • Good Dual-Core processor technology needs good software to exploit it to the fullest
  • Modern OS software already understands SMP and will run more programs in parallel on Dual-Core
    • Most desktop OS software derives from SMP-capable server OS software.
    • Leads to Higher though-put across multiple programs for Desktop
  • SMP (multi-threading) programming model is well understood in server/workstation markets
    • Lots of “embarrassingly parallel” problems and plenty of programmer experience
    • So Dual-Core will be exploited naturally here
multi threading challenges for desktop 1
Multi-threading Challenges for Desktop (1)
  • SMP/multi-threading programming models now become relevant to consumer desktop software
    • Higher performance for a single-program that “decomposes”
    • But SMP is a new realm for desktop software
  • Desktop apps were once not suitable for multi-threading. Now they are proliferating on the desktop:
    • Much of the complexity of multi-threading is hidden by Microsoft CLR, C#, VB.NET, and Java
multi threading challenges for desktop 2
Multi-threading Challenges for Desktop (2)
  • Desktop benefits from multi-cores by being able to run more applications/tasks in parallel
  • A single desktop application can also use multi-threading on a multi-core to do:
    • Multimedia CODECs
    • Games (through double-buffering strategies)
    • Productivity apps (background threads do complex processing while waiting for user input)
    • Speech and handwriting recognition
    • Prosumer digital content creation apps
    • Anti-virus software
    • GUI to give appearance of responsiveness
multi threading challenges for desktop 3
Multi-threading Challenges for Desktop (3)
  • The inhibitor: traditional desktop apps are complex, non-threaded legacy code; code needs to be totally re-written
    • Desktop Software Developers have little experience yet writing multi-threaded code
  • Problem domain for scientific applications is often regular:
    • Compilers and tools can often find cases to generate parallel threads automatically
    • These tools also support a rich set of directives for the programmer to explicitly create threads in a language-specific way
multi threading challenges for desktop 4
Multi-threading Challenges for Desktop (4)
  • Desktop developers often have to manually decompose their complex application into threads
    • The mechanics of threading is being made easier by “new” language environments
    • The decomposition analysis is the hard part
  • Traditional “Data-Parallelism” approach doesn’t fit desktop apps as well
  • Task-Level Parallelism likely will fit better
    • E.G. In a game have threads for physics, audio, graphics, and strategy
amdahl s law 1
Amdahl’s Law (1)

Original

Threaded

Code0

(S)

Code0(S)

Time

Code1

Code1p0

Code1

p1

-1

Speed Up =

S + 1-S

P

  • Decomposing non-threaded program produces “serial” piece & some “parallel” pieces
    • Parallel pieces can be threaded
  • Use Amdahl’s Law to estimate benefit of threading
    • Let “S” be percentage of execution time of serial parts of serial-version of program
    • Let “P” be number of threads issued on same number of CPU cores
  • Example:
    • program spends 25% of execution time in serial portion of serial version, then Speed UP on a dual-core could be 1.6x
amdahl s law 2
Amdahl’s Law (2)

Original

Threaded

Code0

(S)

Code0(S)

Time

Code1

Code1p0

Code1

p1

-1

Speed Up =

S + 1-S

P

  • Example:
    • program spends 75% of execution time in serial portion of serial version, then Speed UP on a dual-core could be 1.14x
  • Know Thy Program …
  • Amdahl’s law assumes overall structure of program doesn’t change going from serial serial+parallel
    • If structure of serial+parallel is drastically different (and more efficient) you may do better than Amdahl’s law
general programming observations 1
General Programming Observations (1)
  • Experiment, but, for K-core system, schedule N threads, where N is: K-2 <= N <= K+2
    • Use OS APIs (or CPUID functions) to test for number of cores
  • Using CPUID Hyperthreading fields to determine presence and number of cores works well
    • Works well for logical and physical cores
    • AMD follows generally accepted CPUID standard
    • Therefore, no need to test for “CPU vendor” before using CPUID Hyperthreading fields
  • Avoid heavy threading on single-core system
    • Threading has some small OS/Program overhead
    • May lead to threaded program running slower than serial version
    • Application should test at install time or run-time to determine if the number of available processors allow threading
general programming observations 2
General Programming Observations (2)
  • Don’t bother trying to do explicit binding of threads to cores, Windows does a fine job automatically
    • If you are in High-Performance-Computing, feel free to ignore the above advice …
  • Windows memory allocation mechanism tries to optimize memory affinity even in large NUMA multi-core systems
virtualalloc address size allocationtype protect
VirtualAlloc( Address, Size, AllocationType, Protect )
  • Standard Win32 API to allocate process virtual memory
    • Reserves (Allocates) or Reserves-and-Commits Virtual Memory
    • Does the right thing for NUMA and multi-core on Microsoft Windows 2003 SP1 despite the MSDN documentation …
  • Per MSDN: AllocationType = MEM_RESERVE
    • Just inserts a node into process’ VirtualAddressDescriptor tree
    • Does not map the virtual page to any physical page; this requires process to follow up with a commit later to the same region
  • Per MSDN: AllocationType = MEM_COMMIT
    • According to MSDN, reserves as above and commits
    • Commitment implies mapping Virtual Pages to zero’d-out physical pages
virtualalloc continued
VirtualAlloc( … ) Continued
  • What really happens for AllocationType = MEM_COMMIT
    • Just inserts a node (entry) into the process VirtualAddressDescriptor tree
    • Commits page(s) only in that no explicit MEM_COMMIT is required afterwards
    • Later when (if) the thread accesses the page(s), page fault handler finds the node (entry) in VAD tree and tries to allocate physical memory
    • Attempts to allocate physical memory from the NUMA node the thread is currently on at the time of the page fault
    • If not available on that NUMA node, grabs it from another
  • Means a parent can allocate memory for child threads and still maintain desired thread-to-memory affinity
  • Affinitizing your threads increases chances that at page fault time, thread will be on the desired processor
global heap
Global Heap
  • The global heap operates under the same rules (calls VirtualAlloc too)
    • the only difference may be when they decide to access the virtual addresses (preload or wait till later)
  • Avoid Global Heap by using Process-Private heaps: standard recommendation for threading
    • HeapAlloc() and friends also use VirtualAlloc()
  • Use Local, Low Fragmentation Heap
    • Global/Local Heap Algorithms are too complex to predict how various heap memory is affinitized for default Heap
low fragmentation heap lfh
Low Fragmentation Heap (LFH)
  • Added to Windows XP and Windows Server 2003
  • An application can use LFH for private heaps
    • Alleviates a lot of global heap unknown affinity issues on a NUMA Multi-core system
    • Use HeapAlloc() followed by HeapSetInformation() to spec LFH
    • LFH Allocations are grouped in buckets and are local to a virtual slot. Assignment of one thread to a particular slot is done automatically when contention is detected.
  • Note that the C runtime heap is LFH by default on 64-bit platforms starting with Windows 2003 SP1
    • Malloc & free should show significant improvement on 64-bit NUMA, multi-core systems
  • LFH has to be explicitly set for 32-bit Windows
codeanalyst thread analysis
CodeAnalyst Thread Analysis
  • Identities threads in the target application
  • Shows thread creation and termination
  • Monitors CPU affinity of each thread
  • Identifies non-local memory access
  • Graphs thread activity on each CPU
miscellaneous windows numa apis
Miscellaneous Windows NUMA APIs
  • GetNumaHighestNodeNumber() = retrieves highest numbered node in system, but doesn’t guarantee that nodes are sequentially numbered.
  • GetProcessAffinityMask() = retrieves the list of processors on the system
  • GetNumaProcessorNode() = returns NUMA node for specified processor.
  • GetNumaNodeProcessorMask() = retrieves list of all processors for a node.
  • SetProcessAffinityMask() = sets affinity for all threads in process.
  • SetThreadAffinityMask() = sets affinity for individual thread.
  • GetNumaAvailableMemoryNode() = retrieve the amount of memory available to a node.
dual core software licensing trends
Dual-Core & Software Licensing Trends

Multi-core processors - Do you count cores?

Virtualization - License by virtual machine?

Hosted computing - Calculate time and resources?

The software industry is facing a shift in pricing & licensing models that rely on a traditional view of hardware technology.

Customers are demanding software licensing to reflect the amount of work done, not the characteristics of the processor.

AMD recommends that ISVs that license by processor continue to do so, rather than switching to licensing by processor core.

  • This helps ensure seamless software compatibility with existing x86 and AMD64 operating systems and apps -- whether they are single-threaded or multi-threaded.
software licensing trends
Software Licensing Trends
  • Recently, Microsoft has clarified its software licensing position with respect to processor and cores.
    • Microsoft server software that is currently licensed by number of processors on the server will continue to be licensed in that model
    • for server hardware that contains dual-core and multi-core processors.
  • This policy helps ensure that customers will not incur additional software licensing requirements or fees when they choose to adopt multi-core processor technology.

AMD applauds this decision which will help make this new enterprise computing technology affordable to customers, including mid-size and small businesses.

some references for multi threading
Some References for Multi-Threading
  • Programming Windows, Fifth Edition , by Charles Petzold
    • Microsoft Press, ISBN: 157231995X
  • Multithreading Applications in Win32: The Complete Guide to Threads, by Beveridge & Wiener,
    • Addison-Wesley Professional; ISBN: 0201442345
  • Multithreaded Programming with Win32, by Pham & Garg
    • Prentice Hall PTR; ISBN: 0130109126
slide45
AMD, the AMD Arrow logo, AMD Opteron, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Pentium is a registered trademark of Intel Corporation in the U.S. and/or other jurisdictions. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. HyperTransport is a trademark of the HyperTransport Technology Consortium. Other licensed names are for informational purposes only and may be trademarks of their respective owners.