1 / 45

Adaptive Data Structures

Adaptive Data Structures. Towards Declarative Queries on. Simon Zeltser. Based on the article by Nicolas Bruno and Pablo Castro. Contents. 1. Introduction. 2. LINQ on Rich Data Structures. 3. LINQ Query Optimization. 4. Conclusions and Discussion. Introduction. THE PROBLEM

aaron
Download Presentation

Adaptive Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Data Structures Towards Declarative Queries on Simon Zeltser Based on the article by Nicolas Bruno and Pablo Castro

  2. Contents 1 Introduction 2 LINQ on Rich Data Structures 3 LINQ Query Optimization 4 Conclusions and Discussion Technion

  3. Introduction • THE PROBLEM • There is an increasing number of applications that need to manage data outside the DBMS • Need for a solution to simplify the interaction between objects and data sources • Current solutions lack rich declarative query mechanism • THE NEED • Unified way to query various data sources • THE SOLUTION • LINQ (Language Integrated Query) Technion

  4. Introduction • LINQ : Microsoft.NET 3.5 Solution • Accessing multiple data sources via the same API • Technology integrated into the programming language • Supports operations: • Traversal – grouping, joins • Filter – which rows • Projection –which columns vargraduates =fromstudent in students wherestudent.Degree = “Graduate” orderbystudent.Name, student.Gender, student.Age selectstudent; • BUT… The default implementation is simplistic • Appropriate for small ad-hoc structures in memory Technion

  5. Introduction • THE GOAL OF THIS SESSION • Introduce LINQ key principles • Show model of customization of LINQ’s Execution Model on Rich Data Structures • Evaluate the results Technion

  6. LINQ – High Level Architecture C# 3.0 Visual Basic Other Languages… .NET Language Integrated Query (LINQ) LINQ – Enabled Data Sources LINQ To Objects LINQ To Entities LINQ To XML LINQ To Datasets LINQ To SQL <xml > Objects Databases XML Technion

  7. Compare two approaches • Iteration List<String>matches = new List<String>(); // Find the matches foreach(stringitemindata) { if (item.StartsWith("Eric")){ matches.Add(item); } } // Sort the matches matches.Sort(); // Print out the matches foreach(stringiteminmatches) } Console.WriteLine(item); { • LINQ // Find and sort matches varmatches = fromn in data wheren.StartsWith("Eric") orderbyn selectn; // Print out the matches foreach(varmatchinmatches) } Console.WriteLine(match); { Technion

  8. Language Integration Lambda Expressions Function intStringLength(String s) { returns.Length(); { Lambda Expression s => s.Length(); Query Syntax var matches = from n in data wheren.StartsWith("Eric") orderby n select n; var matches = data .Where(n => n.StartsWith("Eric")) .OrderBy(n => n) .Select(n => n) Extension Methods publicstaticIEnumerable<TSource> Where<TSource> (thisIEnumerable<TSource> source, Func<TSource, bool> predicate) var name = "Eric"; var age = 43; var person = new { Name = "Eric", Age = 43 }; var names = new [] {"Eric", "Ryan", "Paul" }; foreach (var item in names) Anonymous Types Technion

  9. LINQ - Example Lambda Expressions // Retrieve all CS students with more // than 105 points var query = from stud in students where ( stud.Faculty == “CS” && stud.Points > 105) orderbystud.Pointsdescending selectnew { Details = stud.Name + “:” + stud.Phone }; // Iterate over results foreach(var student in query) { Console.WriteLine(student.Details); } Query Syntax Extension Methods Anonymous Types Technion

  10. Customizing LINQ Execution Model EXPRESSION TREES • LINQ represents queries as in-memory abstract syntax tree • Query description and implementation are not tied together THE PROBLEM • The default implementation of the operations uses fixed, general purpose algorithms SUGGESTED SOLUTION • Change how the query is executed without changing how it’s expressed • Analyze alternative implementations of a given query and dynamically choose the most appropriate version depending on the context. * + 1 5 7 Technion

  11. Customizing LINQ Execution Model (2) PROBLEM EXAMPLE WHERE operator is implemented by performing a sequential scan over the input and evaluating the selection predicateon each tuple! 1 int[] A = {1, 2, 3, 10, 20, 30}; var q = from x in A where x < 5 select 2*x; foreach(intiin q) Console.WriteLine(i); var q = A.Where(x=>x<5) .Select(x=>2*x); 2 IEnumerable<int> q = Enumerable.Project( Enumerable.Where(A, AF1), AF2); bool AF1(int x) { return x<5; } int AF2(int x) { return 2*x; } 3 Query Implementation: IEnumerable<int> res = new List<int>(); foreach(int a in A) if (AF1(a)) res.Add(AF2(a)); return res; Technion

  12. ForeignKeyConstraint Rich Data Structures - DataSet • In-memory cache of data • Typically populated from a database • Supports indexing of DataColumns via DataViews DataSet object DataTable object DataTable object DataRow Unique Constraint Data Column Unique Constraint • We will use LINQ on DataSet for demonstrating query optimization techniques Technion

  13. LINQ on Rich Data Structures • Enable LINQ to work over DataSets. EXAMPLE • Given R and S – two DataTables from r in R.AsEnumerable() join s in S.AsEnumerable() on r.Field<int>(“x”) equals s.Field<int>(“y”) selectnew { a = r.Field<int>(“a”), b = s.Field<int>(“b”) }; Compile and run-time phases on an implementation of our prototype LINQ on DataSet Standard C# Code Interm. Language Expression Tree Optimized Expression Tree Interm. Language Compile Time Run Time Self-tuning State DataSet Technion

  14. Expression Tree Optimizer Our solution will be built according to the following architecture Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion

  15. Query Cost Estimator Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion

  16. Query Estimation - Cost Model • Follow traditional database approach: • COST: {execution plans} -> [expected execution time] • Relies on: • a set of statistics maintained in DataTablesfor some of its columns • formulas to estimate selectivity of predicates and cardinality of sub-plans • formulas to estimate the expected costs of query execution for every operator Technion

  17. Cardinality Estimation • Returns an approximate number of rows that each operator in a query plan would output • To reduce the overhead, we will use only these statistical estimators: • maxVal – maximum number of distinct values • minVal– minimum number of distinct values • dVal – number of distinct values in a column • If statistics are unavailable, rely on “magic numbers” until automatically creation of statistics Technion

  18. Predicate Selectivity Estimation • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T): COST(T) = Card(T) * MEM_ACCESS_COST For each p in {operators of T} Average Cost Of Memory Access Technion

  19. Predicate Selectivity Estimation Intuition: We model sel(co<=c<=c1) as the probability to get a “c” value in interval [c0, c1] among all possible “c” values c1 maxVal(c) c • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T: COST(T) = Card(T) * MEM_ACCESS_COST c0 minVal(c) For each p in {operators of T} Average Cost Of Memory Access Technion

  20. Predicate Selectivity Estimation Consider now a join predicate: T1 c1=c2T2 • Card(T1 c1=c2 T2)= • Let: σp(T ) be an arbitrary expression. • The cardinality of T is defined: Card(σp(T )) =sel(p)· • Under this definition we define: • COSTT(Execution Plan) = Σ (COST(p)) • EXAMPLE:Consider full table scan of table T): COST(T) = Card(T) * MEM_ACCESS_COST For each p in {operators of T} Average Cost Of Memory Access Technion

  21. Query Analyzer Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion

  22. Execution Alternatives • Rely on indexes on DataColumns when possible • Example: σa=7∧(b+c)<20 5 Index on “a” column 3 7 Alternative 1: Alternative 2: 2 4 Full Table Scan a=7 b+c < 20 Technion

  23. Analyzing Execution Plans • Global vs. Local Execution Plan – EXAMPLE: • Local Execution Plan • HashJoin? • IndexJoin? • MergeJoin? Global Execution Plan Technion

  24. Enumeration Architecture • Two phases: • First phase: Join reordering based on estimated cardinalities • Second phase: Choose the best physical implementation for each operator • EXAMPLE:Suppose we analyze JOIN operator. • We evaluate the following JOIN implementations: • Hash Join • Merge Join (inputs must be sorted in the join columns) • Index Join (index on the inner join column must be available) • Other possible calculation options • Choose the alternative with the smallest cost Technion

  25. Query Analysis Self Tuning Organizer Query Analyzer Index Reorganizer Oscillation Manager Cost Model Query Cost Estimator Statistics Manager Technion

  26. Self Tuning Organization • We want to reach the smallest query execution time. • Indexes can be used to speedup query execution • PROBLEM: • It might become problematic to forecast in advance what indexes to build for optimum performance • SOLUTION: • Continuous monitoring/tuning component that addresses the challenge of choosing and building adequate indexes and statistics automatically Technion

  27. Self Tuning Organization - Example • Consider the following execution plan: • The selection predicate Name=“Pam” over CustomersDataTable can be improved if an index on Customers(Name) is built • Both hash joins can be improved if indexes I2 and I3 are available, since we can transform hash join into index join * The three sub-plans enclosed in dotted lines might be improved if suitable indexes were present Technion

  28. Algorithm for automatic index tuning Technion

  29. Index Tuning • High-Level Description: • Identify a good set of candidate indexesthat would improve performance if they were available. • Later, when the optimized queries are evaluated, we aggregate the relative benefits of both candidate and existing indexes. • Based on this information, we periodically trigger index creations or deletions, taking into account storage constraints, overall utility of the resulting indexes, and the cost to creating and maintaining them. Technion

  30. Algorithm for automatic index tuning Technion

  31. Index tuning algorithm • Notation: • H – a set of candidate indexes to materialize • T – task set for query qi • Ii – either a candidate or an existing index • δIi – amount that I would speed up query q H (initially empty) Task Set I1, δI1 I2, δI2 . . … In, δIn Technion

  32. Algorithm for automatic index tuning Technion

  33. Index tuning algorithm • Notation: • ΔI – value maintained for each index I • Materialized index – already created one • SELECT query: ΔI = ΔI + δI • UPDATE query: ΔI = ΔI – δI H I1 Task Set I1, δI1 I1, δI1 I2, δI2 . . … In, δIn Technion

  34. Index Tuning Algorithm • The purpose of ΔI: • We maintain ΔI on every query evaluation • If the potential aggregated benefit of materializing a candidate index exceeds its creation cost, we should create it, since we gathered enough evidence that the index is useful Technion

  35. Algorithm for automatic index tuning Technion

  36. Index tuning algorithm • Remove “bad” indexes phase • Notation: • Δmin– minimum Δ value for index I • Δmax– maximum Δ value for index I • BI – the cost of creating index I • Residual(I)= BI – (Δmax – Δ) (the “slack” an index has before being deemed “droppable”) IF (Residual(I)) <= 0) THEN Drop(I) • Net-Benefit(I) = (Δ-Δmin)-BI (the benefit from creating the index) IF (Net-Benefit(I) >= 0) THEN Add(I) Technion

  37. Algorithm for automatic index tuning Technion

  38. Index tuning algorithm • Notation: • ITM – all the indexes from H which creation is cost effective • ITD – subset of existing indexes such that: • ITD fits in existing memory • It’s still cost effective to create new index I after possibly dropping members from ITD • If creating index I is more effective than maintaining existing indexes in ITD, DROP(ITD) && CREATE(I) • Remove I from H (set of candidate indexes to materialize) Technion

  39. Experimental Evaluation Consider the following schema: checkCarts($1) = from p inProducts.AsEnumerable() join cart inCarts.AsEnumerable() onp.Field<int>("id") equalscart.Field<int>("p_id") join c inCustomers.AsEnumerable() oncart.Field<int>("cu_id") equalsc.Field<int>("id") where c.name = $1 selectnew { cart, p } browseProducts($1) = from p inProducts.AsEnumerable() join c inCategories.AsEnumerable() onp.Field<int>("ca_id") equals c.Field<int>("id") where c.par id = $1 select p • Generated: • 200,000 products • 50,000 customers • 1,000 categories • 5,000 items in the shopping carts Possible Indexes I1 Categories(par_id) I2 Products(c_id) I3 Carts(cu_id) I4 Products(ca_id) I5 Customers(name) Technion

  40. Execution plans for evaluation queries Technion

  41. Experimental Evaluation – Cont. Generated schedule when tuning was disabled Technion

  42. Experimental Evaluation – Cont. Generated schedule when tuning was enabled Technion

  43. Summary • We’ve discussed: • LINQ – for declarative query formulation • DataSet - a uniform way of representing in-memory data. • A lightweight optimizer for automatically adjusting query execution strategies • Article’s main contribution: • NOT a new query processing technique • BUT: careful engineering of traditional database concepts in a new context Technion

  44. Thank You ! Simon Zeltser

  45. LINQ Execution Model • Compiler merges LINQ extension methods • Query syntax is converted to function calls and lambda expressions • Lambda expressions are converted to expression trees • Adds query operations to IEnumerable<T> • Expressions are evaluated at run-time • Parsed and type checked at compile-time • At compile time • Compiler finds a query pattern • Query is executed lazily • Compiler infers types produced by queries • Datasets are strongly typed • Operations on • data sets are strongly typed • Specialized or base • Can optimize and • re-write query • Expressions and operations • can execute remotely • At run-time, when results are used • We can force evaluations (ToArray()) Technion

More Related