Memory architectures for protein folding md on million pim processors
Download
1 / 21

Memory Architectures for Protein Folding: MD on million PIM processors - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Memory Architectures for Protein Folding: MD on million PIM processors. Fort Lauderdale, May 03,. Overview. EIA-0081307: “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem” PIs: Josep Torrellas and Laxmikant Kale (University of Illinois)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Memory Architectures for Protein Folding: MD on million PIM processors' - zytka


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Memory architectures for protein folding md on million pim processors

Memory Architectures for Protein Folding: MD on million PIM processors

Fort Lauderdale, May 03,


Overview
Overview

  • EIA-0081307: “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem”

  • PIs:

    • Josep Torrellas and Laxmikant Kale (University of Illinois)

    • Mark Tuckerman (New York University)

    • Michael Klein (University of Pennsylvania)

    • Also associated: Glenn Martyna (IBM)

  • Period: 8/00 - 7/03


Project description
Project Description

  • Multidisciplinary project in computer architecture and software, and computational biology

  • Goals:

    • Design improved algorithms to help solve the protein folding problem

    • Design the architecture and software of general-purpose parallel machines that speed-up the solution of the problem


Some recent progress ideas
Some Recent Progress: Ideas

  • Developed REPSWA

    • (Reference Potential Spatial Warping Algorithm)

    • Novel algorithm for accelerating conformational sampling in molecular dynamics, a key element in protein folding

    • Based on ``spatial warping'' variable transformation.

      • This transformation is designed to shrink barrier regions on the energy landscape and grow attractive basins without altering the equilibrium properties of the system

    • Result: large gains in sampling efficiency

    • Using novel variable transformations to enhance conformational sampling in molecular dynamicsZ. Zhu, M. E. Tuckerman, S. O. Samuelson and G. J. Martyna, Phys. Rev. Lett.88, 100201 (2002).


Some recent progress tools
Some Recent Progress: Tools

  • Developed LeanMD, a molecular dynamics parallel program that targets at very large scale parallel machines

    • Research-quality program based on the Charm++ parallel object oriented language

    • Descendant from NAMD (another parallel molecular dynamics application) that achieved unprecedented speedup on thousands of processors

    • LeanMD to be able to run on next generation parallel machines with ten thousands or even millions of processors such as Blue Gene/L or Blue Gene/C

    • Requires a new parallelization strategy that can break up the simulation problem in a more fine grained manner to generate parallelism enough to effectively distribute work across a million processors.


Some recent progress tools1
Some Recent Progress: Tools

  • Developed a high-performance communication library

    • For collective communication operations

      • AlltoAll personalized communication, AlltoAll multicast, and AllReduce

      • These operations can be complex and time consuming in large parallel machines

      • Especially costly for applications that involve all-to-all patterns

        • such as 3-D FFT and sorting

    • Library optimizes collective communication operations

      • by performing message combining via imposing a virtual topology

    • The overhead of AlltoAll communication for 76-byte message exchanges between 2058 processors is in the low tens of milliseconds


Some recent progress people
Some Recent Progress: People

  • The following graduate student researchers have been supported:

    • Sameer Kumar (University of Illinois)

    • Gengbin Zheng (University of Illinois)

    • Jun Nakano (University of Illinois)

    • Zhongwei Zhu (New York University)


Overview1
Overview

  • Rest of the talk:

    • Objective: Develop a Molecular Dynamics program that will run effectively on a million processors

      • Each with low memory to processor ratio

    • Method:

      • Use parallel objects methodology

      • Develop an emulator/simulator that allows one to run full-fledged programs on simulated architecture

    • Presenting Today:

      • Simulator details

      • LeanMD Simulation on BG/L and BG/C


Performance prediction on large machines

Problem:

How to predict performance of applications on future machines?

How to do performance tuning without continuous access to a large machine?

Solution:

Leverage virtualization

Develop a machine emulator

Simulator: accurate time modeling

Run a program on “100,000 processors” using only hundreds of processors

Performance Prediction on Large Machines


Blue gene emulator functional view

Communication threads

Communication threads

Worker threads

Worker threads

inBuff

inBuff

CorrectionQ

CorrectionQ

Non-affinity message queues

Non-affinity message queues

Blue Gene Emulator: functional view

Affinity message queues

Affinity message queues

Converse scheduler

Converse Q


Emulator to simulator

Emulator:

Study programming model and application development

Simulator:

performance prediction capability

models communication latency based on network model;

Doesn’t model memory access on chip, or network contention

Parallel performance is hard to model

Communication subsystem

Out of order messages

Communication/computation overlap

Event dependencies

Parallel Discrete Event Simulation

Emulation program executes in parallel with event time stamp correction.

Exploit inherent determinacy of application

Emulator to Simulator


How to simulate
How to simulate?

  • Time stamping events

    • Per thread timer (sharing one physical timer)

    • Time stamp messages

      • Calculate communication latency based on network model

  • Parallel event simulation

    • When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor

    • When a message is received, update current time as:

      • currTime = max(currTime,recvTime)

    • Time stamp correction


Parallel correction algorithm
Parallel correction algorithm

  • Sort message execution by receive time;

  • Adjust time stamps when needed

  • Use correction message to inform the change in event startTime.

  • Send out correction messages following the path message was sent

  • The events already in the timeline may have to move.


Timestamps correction

RecvTime

Execution

TimeLine

M1

M2

M3

M4

M5

M6

M7

M8

Timestamps Correction


Timestamps correction1

RecvTime

Execution

TimeLine

M1

M2

M3

M8

M4

M5

M6

M7

Timestamps Correction


Timestamps correction2

RecvTime

Execution

TimeLine

M1

M2

M3

M4

M5

M6

M7

M8

RecvTime

Execution

TimeLine

M1

M2

M3

M8

M4

M5

M6

M7

Correction Message

Timestamps Correction


Timestamps correction3

RecvTime

Execution

TimeLine

M1

M2

M3

M4

M5

M6

M7

M4

M4

Correction Message (M4)

RecvTime

Execution

TimeLine

M1

M2

M4

M3

M5

M6

M7

Correction Message

Correction Message (M4)

RecvTime

Execution

TimeLine

M1

M2

M3

M5

M6

M4

M7

Correction Message

Timestamps Correction



Leanmd
LeanMD

  • LeanMD is a molecular dynamics simulation application written in Charm++

  • Next generation of NAMD,

    • The Gordon Bell Award winner in SC2002.

  • Requires a new parallelization strategy

    • break up the problem in a more fine-grained manner to effectively distribute work across the extreme large number of processors.


Leanmd performance analysis
LeanMD Performance Analysis

Need readable graphs:

1 to a page is fine, but with larger fonts, thicker lines