Dataflow execution of sequential imperative programs on multicore architectures
Sponsored Links
This presentation is the property of its rightful owner.
1 / 46

陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on
  • Presentation posted in: General

Dataflow Execution of Sequential Imperative Programs on Multicore Architectures 在 多核心架構上以序列式命令程式的資料流執行. 陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C. Introduction. 提出一個嶄新的執行模型 (execution model )

Download Presentation

陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dataflow Execution of Sequential Imperative Programs on Multicore Architectures在多核心架構上以序列式命令程式的資料流執行

陳品杰

Department of Electrical Engineering

National Cheng Kung University

Tainan, Taiwan, R.O.C


Introduction

  • 提出一個嶄新的執行模型(execution model)

    • 使靜態地序列程式(statically-sequential programs)能平行執行

  • 該模型之程式容易開發, 且具平行性

  • 藉由一般的命令式程式語言(ex. C++),

    • 則可達成靜態循序程式之資料流(dataflow)平行執行

  • 在多個處理核心以資料流之方式動態地平行化執行序列程式


Dataflow Execution of Sequential Imperative Programs (1/6)

  • 以資料流(dataflow)模型執行資料驅動(data-driven)執行的程式

    • 相較於控制流(control flow)是序列地執行指令

      • 取而代之的是當指令的運算元可使用時, 則執行該指令

    • 資料相依的指令會自動地依序執行, 而獨立的指令將可平行執行

    • 為了能成功地在多核心環境採用相似的模型

      • 列出一些已知和新的挑戰

        • 實際的程式設計典範

        • 相依性

        • 資源管理

        • 多核心環境下的應用原則


Dataflow Execution of Sequential Imperative Programs (2/6)

  • Dataflow on Multicores

    • 傳統之資料流機器(dataflow machines)是指令層級平行(ILP)

      • 在此將粒度提升至functions, 此已被用於task-level計算

      • 可促進程式碼之重用

    • 更適合於多核心之規模

      • 藉由資料流之方式將function執行於各個core上

        • 達成Function-LevelParallelism(FLP)


Dataflow Execution of Sequential Imperative Programs (3/6)

  • Dataflow on Multicores

    • 在執行時, 每次依序執行循序程式之一個function(而不是一個指令)

    • 在執行function前須先確認運算元

    • 將運算元由個別之暫存器或記憶體位址擴展至object

    • 一個function之輸入運算元: read set

      • 其輸出運算元為: write set

      • 統稱為:data set

      • C++ STL


Dataflow Execution of Sequential Imperative Programs (4/6)

  • Dataflow on Multicores

    • 每個data set裡的object都有一個身分

      • 用於建立各個function間之資料相依

    • 判定是否目前要被執行之function與正在執行之任何function(s)是否相依

      • 如果沒相依,submitted (or delegated) 至某一核心來執行

      • 如果相依, 將該function shelved直至沒有相依

      • 上述兩種情況, 皆會繼續執行下一個function(循序程式)


Dataflow Execution of Sequential Imperative Programs (5/6)

  • Handling Dependences

    • 使用token來處理資料流機制的相依性

    • 採用類似一般資料流機制的技巧, 但有兩個關鍵的改進

        • 傳統的是token與個別記憶體位址關聯取而代之的是把token與objects關聯, 來達到資料抽象

        • 每一個object配置多個read tokens

        • 只有一個write token

        • 藉此管理資料之建立與使用


Dataflow Execution of Sequential Imperative Programs (6/6)

  • Handling Dependences

    • 當function以dataflowexecution執行時須

      • Function會要求read(write) tokens給function裡object的read(write) set

      • 當function取得所需之tokens後, 則已準備好且可執行

    • 一旦function執行完畢, 將會放棄所有tokens給shelved且需要該token的function

    • 當shelved function取得所需tokens後, 將unshelved且執行

    • 此Model亦可以循序方式執行某種function

      • 即當在該function程式順序前的運算完成後才會執行該function

      • 且後續運算亦須等該function完成後才可執行


Dataflow Execution of Sequential Imperative Programs (1/7)

  • Model Overview - Example

Figure 1. (a) Example pseudocode that invokes functions T and T’.

T: {write set} {read set} modifies (reads) objects in its write set(read set).

Data set of T’ is unknown.

Figure 1. (b)

Dynamic invocations of the functions T and T’ ,

in the program order, and the data set of

each invocation.


Dataflow Execution of Sequential Imperative Programs (2/7)

  • Model Overview – Data dependence between the functions

WAR

RAW

WAW

A

T3

D

T5

T2

T1

T6

D

B

T4

B

A

Time

(c) Dataflow graph of the dynamic function stream.


Dataflow Execution of Sequential Imperative Programs (3/7)

  • Model Overview – Execution of the code as per model– t1

A

T1成功取得 read token 給 object A, and write tokens 給B &C

>submitted for execution

T1

T3

P1

B

A

B

T2成功取得 read token 給 object A, and write token 給D

> submitted for execution

T2

T5

T2

T4

T’

P1

D

D

T6成功取得 read token 給 object H, and write token 給G

> submitted for execution

T6

T6

P1

Barrier

Time

t1

t3

t4

t2

Figure 1. (d) Dataflow execution schedule of the function stream


Dataflow Execution of Sequential Imperative Programs (4/7)

  • Model Overview – Execution of the code as per model – t2

A

T1

T3

P1

B

A

T1 完成執行

> 釋出 write token B & C

> T4 取得 write token B 但缺read token D

B

T2

T5

T2

T4

T’

P1

D

D

T6

T6

P1

Barrier

Time

t1

t3

t4

t2


Dataflow Execution of Sequential Imperative Programs (5/7)

  • Model Overview – Execution of the code as per model – t3

A

T1

T3

P1

B

A

B

T2

T5

T2

T4

T’

P1

D

D

T6

T6

P1

Barrier

Time

t1

t3

t4

t2

T2 執行完畢

> 釋出 write token D, and read token A

> T3 取得 write token A, 開始運行

> T4 取得 read token D, 開始運行


Dataflow Execution of Sequential Imperative Programs (6/7)

  • Model Overview – Execution of the code as per model–t4

A

T1

T3

P1

B

A

B

T2

T5

T2

T4

T’

P1

D

D

T6

T6

P1

Barrier

Time

t1

t3

t4

t2

T4 執行完成

> 釋出 write token B, and read token D

> T5 取得 write token B, start execution


Dataflow Execution of Sequential Imperative Programs (7/7)

  • Model Overview – Execution of the code as per model–after t4

A

T1

T3

P1

B

A

B

T2

T5

T2

T4

T’

P1

D

D

T6

T6

P1

Barrier

Time

t1

t3

t4

t2

T' will be submitted for execution after all previous functions complete.


Dataflow Execution of Sequential Imperative Programs (1/1)

  • Deadlock Avoidance

    • 如果二或多個functions建立環狀的tokens相依,token機制在一般的dataflowmodel將會有deadlock發生

    • 例如, 調用T4與T5可能建立一個要求順序

      • T4: 取得B → T5: 等待B → T5: 取得D → T4: 等待D > 導致deadlock

    • 避免token deadlocks 須確保:

      • (i) 某token一次僅能被一個function要求

      • (ii) 依照function要求的順序將tokens給予object(先要先給)

    • 因此T5的token僅能在T4之後索取

WAR

RAW

WAW

D

T5

T2

T1

D

B

D

T4

B

B


Prototype Implementation (1/4)

  • 以C++ runtime library開發執行模型之軟體雛形

  • Static Sequential Program

    • 為了完整表示命令式語言, 此模型允許程式在資料流與序列執行間切換

    • 目前模型之程式是以C++編寫, 如同傳統的序列程式

    • 特別地,使用者須知道

      • functions間潛在地平行

      • Objects在functions間之共享

      • Read set 與 write set


Prototype Implementation (2/4)

  • Static Sequential Program

    • Dataflow Functions

      • Library 提供 df_execute 介面可供程式中之funciton平行執行

        • df_execute是以C++ templates 實作之 runtimefunction

        • 非df_execute之指令將以特定的順序執行

      • Shared data,

        • in the form of global, passed-by-reference objects or pointers to them, that are accessed by a function are passed to it as arguments.

        • Users group them into two sets, one that may be modified (write set) and another that is only read (read set).

        • The C++ STL-based set data structure of the token base class is used to create them.

Figure 2. Example program in the proposed model.


Prototype Implementation (3/4)

  • Static Sequential Program

    • Serial Segments/Functions

      • 使用者可透過df_endinterface來返回序列執行

      • df_endinterface相似於barrier, 使程式之執行脫離dataflow execution

      • 在df_endinterface前的程式執行完後, 往後的程式將以序列執行

        • T’ in our example.

Figure 2. Example program in the proposed model.


Prototype Implementation (4/4)

  • Static Sequential Program

    • Serial Segments/Functions

      • 為使共享之object在主程式裡依順序執行, 提供df_seqinterface

        • df_seq accepts the object instance, the function (object method) pointer and any arguments to it.

        • df_seq需等先前有使用該object者執行完畢, 才可執行,

        • 且後面之程式須等df_seq執行完才可繼續執行

        • df_seqcauses the runtime to suspend the main program context until the associated function finishes operating on the specified object.

        • Execution will proceed from line 6 only after print finishes,

          • but potentially in parallel with other (prior) functions (that are not accessing G).

Figure 2. Example program in the proposed model.


Runtime Mechanics (1/10)

  • 採用多執行緒來實作該機制, 其平行執行是以PthreadAPI實現

    • 將執行緒管理抽象化, 讓使用者不會直接接觸

    • 使用者不須了解機制之底層架構

  • Executing Function on Processing Cores

    • At the start of a program, the runtime creates threads, usually one per hardware context available to it.

      • A double-ended work queue (deque) is then assigned to each thread in the system.

    • Computations are scheduled for execution by a thread by queuing them in the corresponding work deque.


Runtime Mechanics (2/10)

  • Discovering Functions for Parallel Execution

    • 剛開始執行時, 僅一個processor在等待工作抵達它的deque, 其他processors皆為閒置狀態

    • 執行初期相似於序列執行

    • 當遇到df_execute,runtime 將被啟動


Runtime Mechanics (3/10)

  • Discovering Functions for Parallel Execution

    • The runtime processes a dataflow function in three decoupled phases,

      • prelude

      • execute

      • postlude

Figure 3. Logical view of runtime operations to process a dataflow function.


Runtime Mechanics (4/10)

  • Discovering Functions for Parallel Execution

    • In the prelude phase

      • Dereferences pointers to objects in the read/write sets, if need be, and attempts to acquire the tokens

    • Execute phase

      • Successful acquisition of tokens leads to the execute phase (Figure 3: 2), in which the function is delegated for (potentially parallel) execution

      • Specifically, the runtime pushes the program continuation (remainder of the program past the df_execute call) onto the thread's work deque, and executes the function on the same thread.


Runtime Mechanics (5/10)

  • Discovering Functions for Parallel Execution

    • A task-stealing scheduler

      • Running on each hardware context, will cause an idle processor to stealthe program continuation and continue its execution, until it encounters the next df_execute, repeating the process of delegation and pushing of the program continuation onto its work deque.

      • Thus the execution of the program unravels in parallel with executing functions, and possibly on different hardware contexts rather than on one hardware context.


Runtime Mechanics (6/10)

  • Tokens and Dependency Tracking

    • 在程式執行期間,

      • 被配置的object會有

        • 一個 write token

        • 無限多的read tokens (limited only by the number of bits used to represent tokens),

        • 一個 wait list

      • Tokens are acquired for objects that the dataflow functions operate on

      • Released when the functions complete


Runtime Mechanics (7/10)

  • Tokens and Dependency Tracking

    • A token may be granted only if it is available.

    • Figure 4a gives the definition of availability of read and write tokens,

      • and Figure 4b shows the token acquisition protocol.

    • The wait list is used to track functions to which the token could not be granted at the time of their requests.

      • A non-empty wait list signifies pending requests, in the enlisted order.

    • An available token is not granted if an earlier function enqueued in the wait list is waiting to acquire it (Figure 4b: 1).


Runtime Mechanics (8/10)

  • Tokens and Dependency Tracking

Figure 4. The token protocol: (b) Read/Write token acquisition

Figure 4. The token protocol: (a) Definition of availability


Runtime Mechanics (9/10)

  • Shelving Functions/Program Continuations

    • 如果一個function 的tokens無法取得的話

      • functionis enqueued in the wait lists of all the objects for which tokens could not be acquired (Figure 4b: 4 or 5),

      • and subsequently shelved (Figure 3: 1.2).

    • While the shelved function waits for the dependences to resolve,

      • the runtime looks for other independent work from the program continuation to perform

Figure 3.

Figure 4. The token protocol: (b) Read/Write token acquisition


Runtime Mechanics (10/10)

  • Completion of Function Execution

Figure 3. Logical view of runtime operations to process a dataflow function.

Figure 4. The token protocol:(c) Token release


Example Execution (1/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 0

G

R = 0

CPU0

CPU1

CPU2

Figure 5. Example execution


Example Execution (2/15)

H

R = 0

A

R = 1

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 0

G

R = 0

W: T1

W: T1

CPU0

T1

CPU1

CPU2

Figure 5. Example execution


Example Execution (3/15)

H

R = 0

A

R = 2

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 0

G

R = 0

W: T1

W: T1

W: T2

CPU0

T1

steal execution

CPU1

T2

CPU2

Figure 5. Example execution


Example Execution (4/15)

H

R = 0

A

R = 2

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T2

W: T3

W: T1

W: T1

T3

CPU0

T1

CPU1

T2

steal execution

CPU2

Figure 5. Example execution


Example Execution (5/15)

H

R = 0

A

R = 2

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T2

W: T3

W: T1

W: T1

T4

T4

T3

CPU0

T1

CPU1

T2

CPU2

Figure 5. Example execution


Example Execution (6/15)

H

R = 0

A

R = 2

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T2

W: T3

W: T1

W: T1

T5

T4

T5

T4

T3

CPU0

T1

CPU1

T2

CPU2

Figure 5. Example execution


Example Execution (7/15)

H

R = 1

A

R = 2

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T6

W: T2

W: T3

W: T1

W: T1

T5

T4

T5

T4

T3

CPU0

T1

CPU1

T2

CPU2

T6

Figure 5. Example execution


Example Execution (8/15)

H

R = 1

A

R = 1

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T6

W: T2

W: T3

W: T4

T4

T5

T5

T3

CPU0

Can’t execute

CPU1

T2

CPU2

T6

Figure 5. Example execution


Example Execution (9/15)

H

R = 1

A

R = 1

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T6

W: T2

W: T3

W: T4

T4

T5

T5

de_seq

T3

CPU0

CPU1

T2

CPU2

T6

df_seqcauses the runtime to shelve the program continuation beyond df_seq in G’s

wait list and await completion of all functions accessing G

Figure 5. Example execution


Example Execution (10/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 1

G

R = 0

W: T3

W: T4

T4

T5

T5

T3

CPU0

CPU1

CPU2

Figure 5. Example execution


Example Execution (11/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 2

E

R = 0

F

R = 1

G

R = 0

W: T3

W: T3

W: T4

T5

CPU0

T3

CPU1

CPU2

Figure 5. Example execution


Example Execution (12/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 2

E

R = 0

F

R = 1

G

R = 0

W: T3

W: T3

W: T4

T5

CPU0

T3

CPU1

T4

CPU2

T5 will be scheduled for execution once T4 completes

Figure 5. Example execution


Example Execution (13/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 2

E

R = 0

F

R = 1

G

R = 0

W: T3

W: T3

W: T4

T5

CPU0

T3

CPU1

T4

CPU2

G.print()

After print completes, the runtime schedules the program continuation for execution

Figure 5. Example execution


Example Execution (14/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 1

E

R = 0

F

R = 0

G

R = 0

W: T5

CPU0

T5

CPU1

CPU2

The continuation is shelved again, thus preventing further processing of the program,until all in-flight functions finish

Figure 5. Example execution


Example Execution (15/15)

H

R = 0

A

R = 0

B

R = 0

C

R = 0

D

R = 0

E

R = 0

F

R = 0

G

R = 0

CPU0

T’

CPU1

CPU2

Figure 5. Example execution


Conclusion

  • Presented a novel execution model that achieves function-level parallel execution of statically-sequential imperative programs on multicore processors.

    • Parallel tasks (program functions) are dynamically extracted from a sequential program

      • and executed in a dataflow fashion on multiple processing cores using tokens associated with shared data objects,

      • and employing a token protocol to manage the dependences between tasks.

    • Thus combine the benefits of sequential programming and dataflow execution.


  • Login