File processing query processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

File Processing : Query Processing PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

File Processing : Query Processing. 2014, Spring Pusan National University Ki-Joune Li. Basic Concepts of Query. Query Retrieve records satisfying predicates Types of Query Operators Aggregate Query Sorting. Predicate. Select. Relational Operators : Select.

Download Presentation

File Processing : Query Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


File processing query processing

File Processing : Query Processing

2014, Spring

Pusan National University

Ki-Joune Li


Basic concepts of query

Basic Concepts of Query

  • Query

    • Retrieve records satisfying predicates

  • Types of Query

    • Operators

    • Aggregate Query

    • Sorting


Relational operators select

Predicate

Select

Relational Operators : Select

  • Selection (condition)

    • Retrieve records satisfying predicates

    • Example

      • Find Student where Student.Score > 3.5

      • score>3.5(Student)

    • Index or Hash


Relational operators project

Interesting attributes to get

Extract

Relational Operators : Project

  • Project (attributes)

    • Extract interesting attributes

    • Example

      • Find Student.name where score > 3.5

      • name(acore>3.5(Student))

    • Full Scan


Cartesian product

r11

r21

r11

r22

r11

r2n

R1

R2

r12

r21

r11

r21

r12

r22

r12

r22

r12

r2n

r1m

r2n

r1m

r21

r1m

r22

r1m

r2n

Cartesian Product

  • Cartesian Product ()

    • Two Tables : R1 R2

    • Produce all cross products

  • Join ( )

=


File processing query processing

Join

  • Join ( )

    • Select combined records of cartesian product with same value of a common attribute (Natural Join)

    • Example

      Student (StudentName, AdvisorProfessorID, Department, Score)

      Professor(ProfessorName, ProfessorID, Department)

      Student AdivsorProfessorID=ProfessorID Professor

      =  AdivsorProfessorID=ProfessorID(Student Professor)

    • Double Scan : Expensive Operation


Relational algebra

Relational Algebra

  • Relational Algebra

    • Operand : Table (Relation)

    • Operator : Relational Operator (, , , etc)

    • Example: SQL and relational algebra

      • find Student.Name from Student, Professorwhere Student.Score > 3.5 and Student.AdvisorProfessorID=Professor.ID and Professor.Department=‘CSE’

        student.name(score>3.5(Student) Department=‘CSE’ (Professor) )

    • Relational Algebra Specifies the sequence of operations


Query processing mechanism

Query Processing Mechanism

  • Query Processing Steps

    1. Parsing and translation

    2. Optimization

    3. Evaluation


Parsing and translation

Parsing and Translation

  • Parsing Query Statement (e.g. in SQL)

  • Translation into relational algebra

  • Equivalent Expression

    • For a same query statement

      several relation algebraic expressions are possible

    • Example

      • name(balance  2500(account )) name (balance 2500(name, balance(account)))

    • Different execution schedules

  • Query Execution Plan (QEP)

    • Determined by relational algebra

    • Several QEPs may be produced by Parsing and Translation


Query optimization

Query Optimization

  • Choose ONE QEP among QEPs based on

    • Execution Cost of each QEP, where cost means execution time

  • How to find cost of each QEP ?

    • Real Execution

      • Exact but Not Feasible

    • Cost Estimation

      • Types of Operations

      • Number of Records

      • Selectivity

      • Distribution of data


Cost model basic concepts

Cost Model : Basic Concepts

  • Cost Model : Number of Block Accesses

  • Cost

    C = Cindex + Cdata

    where Cindex : Cost for Index Access

    Cdata : Cost for Data Block Retrieval

    • Cindex vs. Cdata ?

      • Cindex: depends on index

      • Cdata

        • depends on selectivity

        • Random Access or Sequential Access

  • Selectivity

    • Number (or Ratio) of Objects Selected by Query


Cost model type of operations

Cost Model : Type of Operations

  • Cost model for each type of operations

    • Select

    • Project

    • Join

    • Aggregate Query

  • Query Processing Method for each type of operations

  • Index/Hash or Not


Cost model number of records

Cost Model : Number of Records

  • Number of Records

    • Nrecord Nblocks

  • Number of Scans

    • Single Scan

      • O(N) : Linear Scan

      • O(logN ) : Index

    • Multiple Scans

      • O(NM ) : Multiple Linear Scans

      • O(N logM ) : Multiple Scans with Index


Selectivity

Selectivity

  • Selectivity

    • Affects on Cdata

    • Random Access

      • Scattered on several blocks

      • Nblock Nselected

    • Sequential Access

      • Contiguously stored on blocks

      • Nblock= Nselected / Bf


Selectivity estimation

Frequency

30

50

60

70

80

90

100

40

Selectivity Estimation

  • Selectivity Estimation

    • Depends on Data Distribution

    • Example

      • Q1 : Find students where 60 < weight < 70

      • Q2 : Find students where 80 < weight < 90

  • How to find the distribution

    • Parametric Method

      • e.g. Gaussian Distribution

      • No a priori knowledge

    • Non-Parametric Method

      • e.g. Histogram

      • Smoothing is necessary

        • Wavelet, Discrete Cosine


Select linear search

Select : Linear Search

  • Algorithm : linear search

    • Scan each file block and test all records to see whether they satisfy the selection condition.

    • Cost estimate (number of disk blocks scanned) = br

      • br denotes number of blocks containing records from relation r

    • If selection is on a key attribute (sorted), cost = (br /2)

      • stop on finding record

    • Linear search can be applied regardless of

      • selection condition or

      • ordering of records in the file, or

      • availability of indices


Select range search

Select : Range Search

  • Algorithm : primary index, comparison

    • Relation is sorted on A

    • For A  V (r)

      • Step 1: use index to find first tuple  v and

      • Step 2: scan relation sequentially

    • For AV (r)

      • just scan relation sequentially till first tuple > v; do not use index

  • Algorithm : secondary index, comparison

    • For A  V (r)

      • Step 1: use index to find first index entry  v and

      • Step 2: scan index sequentially to find pointers to records.

    • For AV (r)

      • scan leaf nodes of index finding pointers to records, till first entry > v


Select range search1

Select : Range Search

  • Comparison between

    • Searching with Index and

    • Linear Search

  • Secondary Index

    • retrieval of records that are pointed to

    • requires an I/O for each record

  • Linear file scan may be cheaper

    • if records are scattered on many blocks

    • clustering is important for this reason


Select complex query

Select : Complex Query

  • Conjunction : 1  2 . . . n(r)

    • Algorithm : selection using one index

      • Step 1: Select a condition of i (i (r) )

      • Step 2: Test other conditions on tuple after fetching it into memory buffer.

    • Algorithm :selection using multiple-key index

      • Use appropriate multiple-attribute index if available.

    • Algorithm :selection by intersection of identifiers

      • Step 1: Requires indices with record pointers.

      • Step 2: Intersection of all the obtained sets of record pointers.

      • Step 3: Then fetch records from file

  • Disjunction : 1 2 . . . n (r)

    • Algorithm :Disjunctive selection by union of identifiers


Join operation

Join Operation

  • Several different algorithms to implement joins

    • Nested-loop join

    • Block nested-loop join

    • Indexed nested-loop join

    • Merge-join

    • Hash-join

  • Choice based on cost estimate

  • Examples use the following information

    • Number of records of (S)customer: 10,000 (R)depositor: 5000

    • Number of blocks of customer: 400 depositor: 100

    • Blocking factors of customer : 250 depositor: 50


Nested loop join

Nested-Loop Join

  • Algorithm NLJ the theta join rsFor each tuple tr in r do beginFor each tupletsin s do begintest pair (tr,ts) to see if they satisfy the join condition if they do, add tr• ts to the result.EndEnd

  • r : outerrelation, s : inner relation.

  • No indices, any kind of join condition.

  • Expensive


Example nested loop join

Example: Nested-Loop Join

s1

r1

s2

r2

B2

B1

B100

B1

r50

s250

r51

s251

r52

B2

s500

r100

s9751

r4951

s9752

r4952

B400

s10000

r5000


Nested loop join performance

Nested-Loop Join : Performance

  • Worst case

    • the estimated cost is nr bs + brdisk accesses, if not enough memory only to hold one block of each relation,

  • Example

    • 5000  400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and

    • 1000  100 + 400 = 1,000,400 disk accesses with customer as the outer relation.

  • If the smaller relation fits entirely in memory,

    • use that as the inner relation.

    • Reduces cost to br + bsdisk accesses.

    • If smaller relation (depositor) fits entirely in memory, cost estimate will be 500 disk accesses.


Block nested loop join

No disk access required

Disk access happens here

Block Nested-Loop Join

  • Algoritm BNLJ

    For each block Br ofr do

    Get Block BrFor each block Bs of s do

    Get Block BsFor each tuple trin Br doFor each tuple tsin Bsdo Check if (tr, ts) satisfy the join condition if they do, add tr• tsto the result.EndEndEndEnd

No disk access required


Example block oriented nested loop join

Example: Block-Oriented Nested-Loop Join

s1

r1

s2

r2

B2

B1

B100

B1

r50

s250

r51

s251

r52

B2

s500

r100

s9751

r4951

s9752

r4952

B400

s10000

r5000


Block nested loop join performance

Block Nested-Loop Join : Performance

  • Worst case

    • Estimate: br bs + br block accesses.

    • Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation)

  • Improvements : If M blocks can be buffered

    • use (M-2) disk blocks as blocking unit for outer relations,

    • use remaining two blocks to buffer inner relation and output

    • Then the cost becomes br / (M-2) bs + br


Indexed nested loop join

Indexed Nested-Loop Join

  • Index lookups can replace file scans if

    • join is an equi-join or natural join and

    • an index is available on the inner relation’s join attribute

      • Can construct an index just to compute a join.

  • Algorithm INLJ

    For each block Br ofr do

    Get Block BrFor each tuple trin Br do Search Index (IDXr , tr.key)

    if found, add tr• tsto the result.End

    End


Indexed nested loop join performance

Indexed Nested-Loop Join : Performance

  • Worst case

    • buffer has space for only one page of r,

    • Cost of the join: br + nr c

      • Where c is the cost of traversing index and fetching matching tuple

      • Number of matching tuples may be greater than one.

  • If indices are available on join attributes of both r and s,

    • use the relation with fewer tuples as the outer relation


Example of nested loop join costs

Example of Nested-Loop Join Costs

  • Assume depositor customer,

    • with depositor as the outer relation.

    • customer have a primary B+-tree index on the join attribute customer-name, which contains 20 entries in each index node.

    • customer has 10,000 tuples,

      • the height of the tree is 4, and

      • one more access is needed to find the actual data

    • Depositor has 5000 tuples

  • Cost of block nested loops join

    • 400*100 + 100 = 40,100 disk accesses assuming worst case memory

  • Cost of indexed nested loops join

    • 100 + 5000 * 5 = 25,100 disk accesses.


Hash join

Hash-Join

  • Applicable for equi-joins and natural joins.

  • A hash function h is used to partition tuples of both relations

    • h : A→ { 0, 1, ..., n }

    • r0, r1, . . ., rn : partitions of r tuples

    • s0, s1. . ., sn : partitions of s tuples

    • r tuples in rineed only to be compared with s tuples in si .


  • Login