Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie

Cours :Grille de donnèes Prof .:Jean-Marc Pierson, Lionel Brunie Date de Présentation :01/02/2006 Étudiant :Sammarco Aniello An Adaptive Distributed Query Processing Grid Service F.Porto - V.F.V.da Silva – M.L.Dutra – B.Schulze Proc. VLDB Workshop on Data Management in Grids VLDB,LNCS 3836, Trondheim, Norway 2-3 September 2005

PLAN 1-INTRODUCTION 2-ABSTRACT DB 3-ARCHITECTURE 4-QUERY PROCESSING 5-GridGreedyNode (G2N) algorithm 6-Query Execution Engine Framework 7-INITIAL RESULT 8-CONCLUSION Slide N.:2

PROJECT CoDIMS (Configurable Data Integration Middleware) It is a distributed grid service for the evaluation of scientific queries . The design of CoDIMS-G focused on conceiving efficient and adaptable query evaluation strategies for the grid environment. TESTBED: It support the pre-processing stage of a scientific visualization application (SVA) at the National Laboratory of Scientific Computing (LNCC) - Brazil - OBJECTIVES SOLUTIONS RESULT FOCUS ON ADAPTIVE PROBLEM Slide N.:3

PROJECT CoDIMS-G OBJECTIVES Dynamic scheduling and allocation of query execution engine modules into grid nodes (2) Adaptability of query execution to variations on environment conditions (3) Support to special scientific operations SOLUTIONS RESULT FOCUS ON ADAPTIVE PROBLEM Slide N.:4

PROJECT CoDIMS-G OBJECTIVES SOLUTIONS Using the processing power available in a grid environment may substantially reduce the time needed for pre-processing virtual particle trajectory. (1) A new node scheduling algorithm “selects grid nodes for parallel evaluation” (2) Extend the Eddy operator RESULT FOCUS ON ADAPTIVE PROBLEM Slide N.:5

PROJECT CoDIMS-G OBJECTIVES SOLUTIONS RESULT Reduction of the sheduling time FOCUS ON ADAPTIVE PROBLEM Slide N.:6

PROJECT CoDIMS-G OBJECTIVES SOLUTIONS RESULT FOCUS ON ADAPTIVE PROBLEM To adapt the execution of an application to the changing conditions of selected grid nodes. The problem in this context is to identify points where execution may be interrupted in a node and restarted in other nodes . Slide N.:7

ABSTRACT DB The Geometryrelation stores data associated with polyhedron's geometry: Geometry (id, time-instant, polyhedron<point>, velocity<point-velocity>) ; Particle relation holds the initial particle position : Particle (part-id, time-instant, point) The Resulting-vectoruser program computes a resulting speed vector in a specific position of the flow path: Resulting-vector (position, polyhedron<point>, velocity<point-velocity>): velocity The Trajectory Computing Program (TCP) computes VP's subsequent position: TCP (particle-id, position, velocity): new-position Velocity relation corresponds to velocity vectors for each time instant. Slide N.:8

ARCHITECTURE OF CoDIMS-G A QE is the component where actual query execution takes place. Instances of QE are instantiated into grid scheduled nodes. Each QE receives a fragment of the DQEP and it is responsible of its execution control . >> Client Interface Users requests are forwarded to the Control component . The Control Component is the essence of the CoDIMS environment which stores, manages, validates and verifies an instance configuration. which sends users requests to the query processing system >> The QEM is responsible for deploying the query execution engine (QE) services at the nodes specified in the DQEP and managing their life-cycle during the query execution. The QEM manages the QEs real-time performance . The Parser transforms the users´ requests in a query graph representation(QG) >> The Query Optimizer (QO) receives the graph and generates a physical distributed query execution plan (DQEP) using a cost model based on data and programs statistics stored in the Metadata Manager (MM). >> The optimizer calls the Scheduler (SC) Component and it indicates the set of interesting nodes to be allocated for the parallelized operator. The scheduler and optimizer cooperate to generate an initial distributed parallel query execution plan DQEP. >> Control Component Metadata Manager Parser Component Query Optimizer Scheduler Component Query Engine 1 Query Engine 2 Query Execution Manager Query Engine n Slide N.:9

DISTRIBUTED QUERY PROCESSING We express a query as a query graph QG, defined as a partial ordered set of operators QG={,}, where  is a set of algebraic operators and  is a set of dependencies relations,where if (w1  w2), with w1, w2  and w1 , then w2 succeds w1 in a bottom-up navigation of the DEQP and not (w2  w1) The optimization algorithm explores the search space of valid plans, in accordance to data dependency restrictions. It considers all valid execution orders of expensive operators in QG Edges. ALTERNATIVES WHY Slide N.:10

DISTRIBUTED QUERY PROCESSING ALTERNATIVES (a)non parallelization (b)scheduling according to the G2N algorithm (Grid Greedy ) (c) adoption of the same parallelization strategy used by the previous operator in the query execution plan. For each computed query execution plan, a cost is associated, using a parallel pipeline cost function. The DQEP presenting the lowest cost is selected for execution. WHY Slide N.:11

DISTRIBUTED QUERY PROCESSING ALTERNATIVES WHY This strategy guarantees that costly programs only get invoked when all predicates have been evaluated, eventually reducing the number of tuples to be processed by them Slide N.:12

IMPLEMENTATION GridGreedyNode (G2N) algorithm G2N (throughput(tp1,tp2,…, tpn ),number-tasks):result nodelist:= descending order(throughput); result:= result {nodelist(1)}; cost(1):= number-tasks * nodelist(1); current-cost:=cost(1); While (nodes in the list and add-new-node) total-cost:= current-cost; new-node:= next-node in nodelist; While (current-cost <= total-cost) move tuples from lowest node in result to new-node; Update costs of nodes and total-cost; If current-cost > total-cost If we could move at least 1 tuple to the new-node result:= result {new-node} else add-new-node:=false; Stop loop; endwhile endwhile output result; The loop node to new grid node . It produce a new evaluation estimation that reduce query elapsedtime,until actual elapsedtime becomes higher the last computed. Conversely, the algorithm stops and outputs the grid nodes accepted so far >> OUTPUT : Load Query Optimazer with the initial query execution plan and the re-scheduling of allocated nodes in face of variations on estimated values >> The G2N algorithm receives a set of available nodes with corresponding average throughput (tp1;tp2;…tpn), measured in tuples per second. The total estimated number of tasks (T) to be evaluated >> The algorithm classifies the list of available grid nodes in decreasing order of their corresponding average throughput values. It then allocates all T tuples to the fastest node >> Slide N.:13

ADAPTIVE QUERY EXECUTION - QEEF Query Execution Engines(QEE) for supporting the execution of traditional queries. QEEF (Query Execution Engine Framework): an extensible QEE adapted to new execution models that implement each execution model as a combination of execution modules SIMULATION ANALISIS ON BLOCK SIZE Slide N.:14

ADAPTIVE QUERY EXECUTION - QEEF SIMULATION Eddy MERGE SPLIT RECEIVE SEND SEND RECEIVE SEND RECEIVE SEND RECEIVE ANALYSIS ON BLOCK SIZE Slide N.:15

ADAPTIVE QUERY EXECUTION - QEEF SIMULATION ANALYSIS ON BLOCK SIZE Block size is an important tool to build adaptivity into the system. Eddy modifies a remote node block size in the following scenarios : 1-TimeOut(estimated time) 2- eddy proceeds a local adaptation(checking on current throughput values) 3- variations scheduled nodes 4- When 2/3 tuples have beene valuated: - dataflow reduced -Eddy recomputes the number of scheduled nodes - increase the number of tuples in each node Slide N.:16

SCIENTIFIC APPLICATIONS QEEF framework has been extended with : -user's program execution (strategy Apply operator) -spatial and temporal hash-joins (implements the iterator interface) -loop control over query execution plan fragment (repetitively evaluated) INITIAL RESULT Slide N.:17

SCIENTIFIC APPLICATIONS INITIAL RESULT The project configuraation : -java 1.4.2 and globus 3.2.1 -20 pentium IV 20 pentium IV, 1.7 GHz, processors with 256 MB of RAM, running linux 2.4.20-31.9 We considered : an instance with 1000 particles and executing 25 iterations by each particle. Than we Obtained increasing : from 1 node to 25 nodes Results : demonstrated a gain of up to 11 times with 20 machines, with respect to a centralized execution (With 2.7 tuples for second). Problem : blocking size update strategy to be very useful . Slide N.:18

CONCLUSION CoDIMS-G, which is an adaptive distributed query processing grid service. The proposed query execution strategy extends eddy adaptive query execution model for the grid. Environment,considering the variations on grid nodes run-time conditions. Slide N.:19

by Paul Horn, senior vice president, IBM research: “The information-technology industry loves to prove the impossible possible” Mercì! Slide N.:20

Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie

Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie

Presentation Transcript

MARC Technical Assistance Workshop

Marc Prensky

Marc Chagall

Class 2 Statistical Inference

Lionel Messi retires from international football