1 / 36

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks. Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University

Download Presentation

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Consolidation:A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University of Patras, Greece and Research Academic Computer Technology Institute, Patras, Greece Conference: CCGRID 2008

  2. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  3. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  4. Introduction • Lots of applications benefit from Grid computing: • Computation-intensive applications: Involving computationally intensive problems on small datasets. • Data-intensive applications: Performing computations on large sized data stored at geographically distributed resources. (NOTE: Such Grid is usually referred to as a Data-Grid.)

  5. Introduction • We evaluate a task scheduling and data migration problem called data consolidation(DC).

  6. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  7. Previous Work • Most of the related works assume that each task needs, for its execution, only one piece of large data. As a result, this obvious scenario is ignored in most related works.

  8. Previous Work • In “Intelligent Scheduling and Replication in Datagrids: a Synergistic Approach” • Each task need one or more pieces of data for its execution. • Tabu-search scheduler • Optimize execution time and system utilization

  9. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  10. Problem Formulation • A Grid Network consists of: • a set R of N sites: each r∈R contains at least one of the following entities • computation resource • storage resource • network resource • Each computation resource has a local scheduler and a queue. • There is a central scheduler responsible for the task scheduling and data management. (This scheduler has complete knowledge of the static and dynamic characteristics of the sites)

  11. Problem Formulation

  12. Problem Formulation • On receiving the user’s request, the central scheduler examines the computation and data related characteristics of the task. • Based on the used DC algorithm, the central scheduler selects: • The sites that hold replicas of the datasets the task needs. • The site where these datasets will consolidate and the task will be executed. (This site is called DC site.) NOTE: The inequality must be satisfied.

  13. Problem Formulation • The scheduler orders the data holding sites to transfer the datasets to the DC site. • And orders the user to transfer his task to the DC site. • After the task finishes execution, the result return back to the originating user.

  14. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  15. Theoretical Analysis • Assume that the scheduler has selected the data holding sites, rk∈R, for all datasets Ik, k=1,2,…,L, and the DC site. • DC site may already have some pieces of data and thus no transferring is required for these pieces.

  16. Theoretical Analysis • In general, the data-intensive task experiences • communication delay (Dcomm) • processing delay (Dproc)

  17. Theoretical Analysis • communication delay (Dcomm) Dcomm=Dcons+Doutput =

  18. Theoretical Analysis • processing delay (Dproc) Dproc=

  19. Theoretical Analysis • The total delay suffered by a task is DDC=Dcomm+Dproc.

  20. Proposed Techniques • We propose a number of categories of DC algorithms: • Time • ConsCost • ExecCost • TotalCost • Traffic • SmallTrans

  21. Time • Consolidation-Cost (ConsCost) algorithm: We select the replicas and the DC site that minimize the data consolidation time (Dcons) • Given a candidate DC site rj, for each Ik we search ri holding Ik such that is min, and hence the data consolidation time of rj is • Finally, we can determine the DC site:

  22. Time • Execution-Cost (ExecCost) algorithm: We select the DC site that minimizes the task’s execution time: While the data replicas are randomly chosen. NOTE: is difficult to calculate, but we can estimate it based on: • the tasks already assigned to it (ri). • the average delay the tasks executed on it have experienced.

  23. Time • Total-Cost (TotalCost) algorithm: We select the replicas and the DC site that minimize the total task delay. Namely, the algorithm is the combination of the two above algorithms (ConsCost and ExecCost).

  24. Traffic • Smallest-Data Transfer (SmallTrans) algorithm: We select the DC site for which the smallest number of datasets (or the datasets with the smallest total size) need to be consolidated for the task’s execution.

  25. Random • Random-Random (Rand) algorithm: The data replicas used by the task and the DC site are randomly chosen. • Random-Origin (RandOrig) algorithm: The data replicas used by the task are randomly chosen and the DC site is the one that created the task.

  26. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  27. Simulation • We use NSFNET topology, which contains: • 14 nodes (only 5 nodes are equipped with a computation and storage resource (such nodes are called sites)) • And each site has equal storage and computation capacity. • one node exists in the network acting as a Tier 0 site and holds all the datasets • 21 links (all link capacities are equal to 1Gbps)

  28. NSFNET topology

  29. Assumption • Only one transmission is possible at a time over a link. • is not taken into account. • 50 datasets exist in the network initially. Two copies exists for each dataset. (one is distributed among 5 sites, the other is placed at Tier 0 site) • In each experiment, users generate a total of 50,000 tasks. • We keep constant the average total data size S (15000 MB): (L: number of datasets a task requests;I: the average size of each dataset) And we examine the following (L,I) pair values: (2,7500),(3,5000),(4,3750),(6,2500),(8,1875),(10,1500) • The workload of a task correlates with the average total data size: (a is a parameter such that tasks are more data-intensive as a decreases)

  30. Simulations DC probability: The probability that the DC site will not have all the required datasets

  31. Simulations Task delay: the time between its creation and the time it completes.

  32. Simulations Network load depends on: 1. the size of datasets transferred. 2. the number of hops these datasets traverse.

  33. Simulations

  34. Simulations

  35. Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

  36. Conclusion • If DC is performed efficiently, important benefits can be obtained in terms of task delay, network load and other performance parameters of interest.

More Related