query evaluation techniques for larger databases n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Query Evaluation Techniques for Larger Databases** PowerPoint Presentation
Download Presentation
Query Evaluation Techniques for Larger Databases**

Loading in 2 Seconds...

play fullscreen
1 / 65

Query Evaluation Techniques for Larger Databases** - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Query Evaluation Techniques for Larger Databases**. By Goetz Graefe Elaborado por: Edwin Andrés Bernal López Claudia Jeanneth Becerra Cortés Curso: Tópicos Avanzados de Bases de Datos Bogotá, Marzo 23 del 2006.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Query Evaluation Techniques for Larger Databases**' - nalanie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
query evaluation techniques for larger databases

Query Evaluation Techniques for Larger Databases**

By Goetz Graefe

Elaborado por: Edwin Andrés Bernal López

Claudia Jeanneth Becerra Cortés

Curso: Tópicos Avanzados de Bases de Datos

Bogotá, Marzo 23 del 2006

**Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751, Received January 1992, final revision accepted February 1993, Published ACM Computing Surveys, Vol. 25, No 2, June 1993.

slide2
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide3
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide4
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide5
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide6
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide7
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide8
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
slide9
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
estado del arte en query processing 93
Estado del Arte en Query Processing/93

Bulletin of the Technical Committee on

Data Engineering

December, 1993 Vol. 16 No. 4 IEEE Computer Society

Special Issue on Query Processing in Commercial Database Systems

Letter from the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Goetz Graefe

Query Optimization in the IBM DB2 Family . . . . . . . . . . . . . . . . . . Peter Gassner,and Guy Lohman

Query Processing in the IBM Application System 400. . . . . . . . .Richard L. Cole, Mark J. Anderson

Query Processing in NonStop SQL . . A. Chen, Y-F Kao, M. Pong, D. Shak, S. Sharma, J. Vaishnav

Query Processing in DEC Rdb: Major Issues and Future Challenges . . . . . Gennady Antoshenkov

Letter from the Editor-in-Chief

“… Goetz Graefe, our issue editor, has succeeded in overcoming these difficulties. He has collected four papers from prominent database vendors. These papers introduce us to the inside world of ”real” query processing”

Letter from the Special Issue Editor

“…Second, in some aspects of query processing, the industrial reality has bypassed academic research. By asking leaders in the industrial field to summarize their work, I hope that this issue is a snapshot of the current state of the art. Undoubtedly, some researchers will find inspirations for new, relevant work of their own in these articles.”

slide14
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
tabla de contenido del paper 1a pte
Tabla de Contenido del Paper (1a. Pte)
  • INTRODUCTION
  • ARCHITECTURE OF QUERY EXECUTION ENGINES
  • SORTING AND HASHING
  • 2.1 Sorting
  • 2.2.Hashing
  • 3. DISK ACCESS
  • 3.1 File Scans
  • 3.2 Associative Access Using Indices
  • 3.3. Buffer Management
  • 4. AGGREGATION AND DUPLICATE REMOVAL
  • 4.1 Aggregation Algorithm Based on Nested Loops
  • 4.2 Aggregation Algorithms Based on Sortlng
  • 4.3. Aggregation Algorithms Based on Hashing
  • 4.4. A Rough Performance Comparison
  • 4.5. Additional Remarks on Aggregation
  • 5. BINARY MATCHING OPERATIONS
  • 5.1. Nested-Loops Join Algorithms
  • 5.2. Merge-Join Algorithms
  • 5.3. Hash Join Algorithms
  • 5.4. Pointer-Based Joins
  • 5.5. Rough Performance Comparison
  • 6. UNIVERSAL QUANTIFICATION
  • 7. DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS
tabla de contenido del paper 2a pte
Tabla de Contenido del Paper (2a. Pte)
  • 8. EXECUTION OF COMPLEX QUERY PLANS
  • 9. MECHANISMS FOR PARALLEL QUERY EXECUTION
  • 9.1. Parallel versus Distributed Database Systems
  • 9.2 Forms of Parallelism
  • 9.3. Implementation Strategies
  • 9.4. Load Balancing and Skew
  • 9.5. Architectures and Architecture Independence
  • PARALLEL ALGORITHMS
  • 10.1 Parallel Selections and Updates
  • 10.2. Parallel Sorting
  • 10.3. Parallel Aggregation and Duplicate Removal
  • 10.4. Parallel Joins and Other Binary Matching Operations
  • 10.5. Parallel Universal Quantification
  • 11. NON STANDARD QUERY PROCESSING ALGORITHMS
  • 11.1. Nested Relations
  • 11.2. Temporal and Scientific Database Management
  • 11.3. Object-oriented Database Systems
  • 11.4. More Control Operators
  • 12. ADDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT
  • 12.1 . Precomputatlon and Derived Data
  • 12.2. Data Compression
  • 12.3. Surrogate Processing
  • 12.4. Bit Vector Filtering
  • 12.5. Specialized Hardware
  • SUMMARY AND OUTLOOK
slide25

Access Path

Algorithm + data structure used to locate rows satisfying some condition

  • File scan: can be used for any condition
  • Hash: equality search; all search key attributes of hash index are specified in condition
  • B+ tree: equality or range search; a prefix of the search key attributes are specified in condition
  • Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified in condition
general external merge sort
General External Merge Sort
  • To sort a file with N pages using B buffer pages:
    • Pass 0: use B buffer pages. Produce sorted runs of B pages each.
    • Pass 2, …, etc.: merge B-1 runs.
cost of external merge sort
Cost of External Merge Sort
  • Number of passes:
  • Cost = 2N * (# of passes)
  • E.g., with 5 buffer pages, to sort 108 page file:
    • Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages)
    • Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages)
    • Pass 2: 2 sorted runs, 80 pages and 28 pages
    • Pass 3: Sorted file of 108 pages
double buffering

INPUT 1

INPUT 1'

INPUT 2

OUTPUT

INPUT 2'

OUTPUT'

b

block size

Disk

INPUT k

Disk

INPUT k'

B main memory buffers, k-way merge

Double Buffering

To reduce wait time for I/O request to complete, can prefetch into `shadow block’.

  • Potentially, more passes; in practice, most files still sorted in 2-3 passes.
sorting records
Sorting Records!
  • Sorting has become a blood sport!
    • Parallel sorting is the name of the game ...
  • Datamation: Sort 1M records of size 100 bytes
    • Typical DBMS: 15 minutes
    • World record: 3.5 seconds
      • 12-CPU SGI machine, 96 disks, 2GB of RAM
  • New benchmarks proposed:
    • Minute Sort: How many can you sort in 1 minute?
    • Dollar Sort: How many can you sort for $1.00?
using b trees for sorting
Using B+ Trees for Sorting
  • Scenario: Table to be sorted has B+ tree index on sorting column(s).
  • Idea: Can retrieve records in order by traversing leaf pages.
  • Is this a good idea?
  • Cases to consider:
    • B+ tree is clusteredGood idea!
    • B+ tree is not clusteredCould be a very bad idea!
external sorting vs unclustered index
External Sorting vs. Unclustered Index
  • p: # of records per page
  • B=1,000 and block size=32 for sorting
  • p=100 is the more realistic value.
relational operations
Relational Operations
  • We will consider how to implement:
    • Selection ( ) Selects a subset of rows from relation.
    • Projection ( ) Deletes unwanted columns from relation.
    • Join ( ) Allows us to combine two relations.
    • Set-difference ( ) Tuples in reln. 1, but not in reln. 2.
    • Union ( ) Tuples in reln. 1 and in reln. 2.
    • Aggregation (SUM, MIN, etc.) and GROUP BY
  • Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.
access path
Access Path

Algorithm + data structure used to locate rows satisfying some condition

  • File scan: can be used for any condition
  • Hash: equality search; all search key attributes of hash index are specified in condition
  • B+ tree: equality or range search; a prefix of the search key attributes are specified in condition
  • Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified in condition
access paths
Access Paths
  • A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key.
    • E.g., Tree index on <a, b, c> matches the selectiona=5 AND b=3, and a=5 AND b>6, but notb=3.
  • A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index.
    • E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but it does not matchb=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.
access paths supported by b tree
Access Paths Supported by B+ tree

Example: Given a B+ tree whose search key is the sequence of attributes a2, a1, a3, a4

  • Access path for search a1>5  a2=3.0  a3=‘x’ (R): find first entry having a2=3.0  a1>5  a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries
  • Access path for search  a2=3.0  a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries
  • No access path for search  a1>5  a3 =‘x’ (R)
choosing an access path
Choosing an Access Path
  • Selectivity of an access path refers to its cost
    • Higher selectivity means lower cost (#pages)
  • If several access paths cover a query, DBMS should choose the one with greatest selectivity
  • Size of domain of attribute is a measure of the selectivity of domain
  • Example:  CrsCode=‘CS305’  Grade=‘B’ - a B+ tree with search key CrsCode is more selective than a B+ tree with search key Grade
computing selection
Computing Selection

condition: (attr op value)

  • No index on attr:
    • If rows unsorted, cost = F
      • Scan all data pages to find rows satisfying the condition
    • If rows sorted on attr, cost = log2F + (cost of scan)
      • Use binary search to locate first data page containing row in which (attr = value)
      • Scan further to get all rows satisfying (attr op value)
computing selection1
Computing Selection

condition: (attr op value)

  • B+ tree index on attr (for equality or range search):
    • Locate first index entry corresponding to a row in which (attr = value); cost = depth of tree
    • Clustered index - rows satisfying condition packed in sequence in successive data pages; scan those pages; cost depends on number of qualifying rows
    • Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once
unclustered b tree index
Unclustered B+ Tree Index

Index entries

satisfying

condition

data page

Data File

B+ Tree

computing selection2
Computing Selection
  • Hash index on attr (for equality search only):
    • Hash on value; cost  1.2 (to account for possible overflow chain) to search the (unique) bucket containing all index entries or rows satisfying condition
      • Unclustered index - sort pointers in index entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once

condition: (attr = value)

complex selections
Complex Selections
  • Conjunctions: a1 =x a2 <y  a3=z (R)
    • Use most selective access path
    • Use multiple access paths
  • Disjunction:  (a1 =x or a2 <y) and (a3=z) (R)
    • DNS (disjunctive normal form)
    • (a1 =x  a3 =z) or (a2 < y  a3=z)
    • Use file scan if one disjunct requires file scan
    • If better access path exist, and combined selectivity is better than file scan, use the better access paths, else use a file scan
two approaches to general selections
Two Approaches to General Selections
  • First approach:Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:
    • Most selective access path: An index or file scan that we estimate will require the fewest page I/Os.
    • Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.
    • Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked.
intersection of rids
Intersection of Rids
  • Second approach(if we have 2 or more matching indexes that use Alternatives (2) or (3) for data entries):
    • Get sets of rids of data records using each matching index.
    • Then intersect these sets of rids
    • Retrieve the records and apply any remaining terms.
    • Consider day<8/9/94 AND bid=5 AND sid=3. If we have a B+ tree index on day and an index on sid, both using Alternative (2), we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5.
agregaci n remoci n de duplicados
Agregación, Remoción De Duplicados

La idea de la agregación es representar un grupo de items mediante un solo valor o clasificar items en grupos y determinar un valor por cada grupo.

  • Agregación Escalar
  • Agregación por funciones
agregaci n remoci n de duplicados 3
Agregación, Remoción De Duplicados (3)
  • Algoritmo de Agregación basado en Ordenamiento
  • El ordenamiento permite agrupar items con características similares así se hace mucho más sencillo hacer la remoción de datos duplicados.
agregaci n remoci n de duplicados 4
Agregación, Remoción De Duplicados (4)
  • Algoritmo de Agregación basado en Ordenamiento
    • La cantidad de datos de entada y salida calculada para éste algoritmo es la siguiente:
    • Donde 2 es el factor para considerar lectura y escritura, R es el tamaño de la entrada, L1 es el número de niveles que no se han visto afectados, O es el tamaño de la salida Y W es el número estimado de ejecuciones del algoritmo
agregaci n remoci n de duplicados 5
Agregación, Remoción De Duplicados (5)
  • Algoritmo Basado en Hashing
    • La idea general es realizar particiones de los datos que se están analizando
    • Se genera una tabla que contiene esencialmente items de salida.
    • La cantidad de entradas o salidas para la agregación depende del número de niveles necesarios.
agregaci n remoci n de duplicados 6
Agregación, Remoción De Duplicados (6)
  • Algoritmo Basado en Hashing(2)

2 X (R (L + 1) – FL X (M – [(R’ /G – M)/(M – C)] X C X G )

Dónde L es el nivel de recursividad, R es el tamaño de entradas de archivos, K es el número de archivos de partición, F es el Fan out, M es el tamaño de archivos para llegar al desbordamiento de memoria.

agregaci n remoci n de duplicados 7
Agregación, Remoción De Duplicados (7)
  • Gráfica comparación de los algoritmos
operaciones binarias para matching
Operaciones Binarias Para “Matching”

De la misma manera en que los procesos de eliminación y agregación son importantes en grupos de datos de tamaño considerable, es deseable también poder cotejar la información, ésta es la función principal del “matching” establecer estas relaciones existentes. Para tal fin se hace uso principalmente del join de las siguientes maneras.

operaciones binarias para matching2
Operaciones Binarias Para “Matching”
  • Algoritmos “Join” Basados en “Loops” anidados
    • Es el algoritmo más simple.
    • Para cada entrada seleccionada hace una búsqueda completa en el resto de los datos, para de ésta manera encontrar los “matches”.
    • Se requiere un archivo temporal de la entrada que esta siendo escaneada.
    • Obviamente es un algoritmo con poco rendimiento para grupos de datos muy grandes
operaciones binarias para matching3
Operaciones Binarias Para “Matching”
  • Algoritmos “Merge-Join”
    • Este algoritmo requiere que las entradas estén previamente ordenadas para obtener los resultados; el procedimiento es similar al que previamente se revisó.
    • Al estar las entradas ordenadas el algoritmo no requiere de memoria adicional excepto cuando el valor total de los paquetes es mayor que el tamaño de la memoria
operaciones binarias para matching4
Operaciones Binarias Para “Matching”
  • Algoritmos “Merge-Join
  • Se puede realizar una combinación entre los el anterior algoritmo y el presente para optimizar los resultados.
  • Dado que los algoritmos anteriores necesitan de cierta cantidad de memoria es conveniente realizar una asignación así
  • W = R / (2 X M) +1 , dónde R es el tamaño de la entrada, M es la cantidad de memoria necesaria y las otras dos son constantes de lectura y escritura.
operaciones binarias para matching5
Operaciones Binarias Para “Matching”
  • Algoritmos “Hash Join”
  • Estos algoritmos se desarrollan partiendo de la idea básica de realizar la tabla “hash” y de probar ésta tabla usando los items de otra entrada.
  • Este algoritmo presenta características en contra como el constante desbordamiento de memoria pero se han realizado varias investigaciones en éste entorno para mejorar las soluciones dadas.
operaciones binarias para matching6
Operaciones Binarias Para “Matching”
  • La partición realizada por el algoritmo se podría interpretar de la siguiente manera