Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu- Ghazaleh

Constructing and Characterizing Covert Channels on GPGPUs HodaNaghibiJouybari Khaled N. Khasawnehand Nael Abu-Ghazaleh

Covert Channel Malicious indirect communication of sensitive data. • Why? • There is no communication channel. • The communication channel is monitored. • Covert channel is undetectable by monitoring systems on conventional communication channels. Spy Trojan Covert Channel Gallery App Weather App

Covert channel are a substantial threat on GPGPUs • Trends to improve multiprogramming on GPGPUs. • GPU-accelerated computing available on major cloud platforms • No protection offered by an Operating system • High quality (low noise) and Bandwidth

Overview • Threat: Using GPGPUs for Covert Channels. • To demonstrate the threat: We construct error-free and high bandwidth covert channels on GPGPUs. • Reverse engineer scheduling at different levels on GPU • Exploit scheduling to force colocation of two applications • Create contention on shared resources • Remove noise • Key Results: Error-free covert channels with bandwidth of over 4 Mbps.

GPU Architecture • Intra-SM Channels: L1 constant cache, functional units and warp schedulers • Inter-SM Channels: L2 constant cache, global memory

Attack Flow

Colocation (Reverse Engineering the Scheduling) Step 1: Thread block scheduling to the SMs Kernel 2 Kernel 1 TBn TB0 TB1 TBn TB0 TB1 GPU TB0 TB1 Thread Block Scheduler TB0 TB1 TBn TBn LeftoverPolicy SM1 SMm SM0 Interconnection Network L2 Cache and Memory Channels

Step 2: Warp to warp schedulers mapping TB TB W1 W0 Wk-1 Wk W1 W0 Wk-1 Wk Warp Scheduler Warp Scheduler TBj TBi SMk Dispatch Unit Dispatch Unit Register File SFU DP SP SP L/D SP DP L/D SFU SP SP SP Shared Memory / L1 Cache

Attack Flow

Cache Channel (intra-SM and inter-SM) • Extracting the cache parameters using latency plot. (cache size, number of sets, number of ways and line size) • Communicating through one cache set. Spy Trojan Eviction of Spy data Send 0 Send 1 Cache misses Low Latency Cache Hit Constant Cache set x No Access! Higher Latency Constant Memory Trojan Data Array (TD) Spy Data Array (SD)

Synchronization: L1 Constant cache 1 Wait (ReadytoSend) Trojan Wait (ReadytoReceive) 1 Spy 1 0 Receive 6 bits …011001 0 Thread 0-5 1 Thread 0-5 1 0

Synchronization and Parallelization GPU SM 0 SM 1 SM n …

SFU and Warp scheduler Channel (intra-SM) • Limitation on number of issued operations in each cycle: • Type and number of functional units. • Issue bandwidth of warp schedulers Contention is isolated to warps assigned to the same warp scheduler. Kepler SM

SFU and Warp scheduler Channel (intra-SM) Spy Trojan Base Channel • Does operations to the target functional unit to create contention to send “1”. • No operation to send “0”. Does operations to the target functional unit and measures the time. Low latency: “0” High latency: “1” Communicating different bits through warps assigned to different warp schedulers. Improved BW Parallelism at SMlevel Parallelism at Warp Scheduler level

Attack Flow

Back Propagation Kmeans Heart Wall K-Nearest Neighbor … What about other concurrent applications co-located with spy and trojan? GPU SM …

Exclusive Colocation of Spy and Trojan Concurrency limitations on GPU hardware (leftover policy): • Shared Memory • Register • Number of Thread blocks Trojan Spy TBn TB0 TB1 TBn TB0 TB1 Prevented interference from RodiniaBenchmark workloads on covert communication and achieved error free communication in all cases. GPU SM Shared Memory Shared Memory Spy … Back Propagation Register Register Trojan Kmeans Heart Wall K-Nearest Neighbor … No Resource Left!

Results L1 Cache Covert channel bandwidth on three generations of Real NVIDIA GPUs 12.9 x Error-free bandwidth of over 4 Mbps The fastest known micro-architectural covert channel under realistic conditions. 3.8 x 1.7 x

Results SFU Covert channel bandwidth on three generations of Real NVIDIA GPUs 13 x 3.5 x

Conclusion • GPUs improved multiprogramming makes the covert channels a substantial threat. • Colocation at different levels by leveraging thread block scheduling and warp to warp scheduler mapping. • GPU inherent parallelism and specific architectural features provides very high quality and bandwidth channels; up to over 4Mbps error-free channel.

Thank You!

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu- Ghazaleh