improving hash join performance by exploiting intrinsic data skew
Download
Skip this Video
Download Presentation
Improving Hash Join Performance By Exploiting Intrinsic Data Skew

Loading in 2 Seconds...

play fullscreen
1 / 47

Improving Hash Join Performance By Exploiting Intrinsic Data Skew - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Improving Hash Join Performance By Exploiting Intrinsic Data Skew. by Bryce Cutt supervised by Dr. Ramon Lawrence. Introduction. Databases are part of our lives Hash Join is a core database algorithm Very I/O intensive for large databases Queries may take hours

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Improving Hash Join Performance By Exploiting Intrinsic Data Skew' - wylie-holcomb


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving hash join performance by exploiting intrinsic data skew

Improving Hash JoinPerformance By ExploitingIntrinsic Data Skew

by

Bryce Cutt

supervised by

Dr. Ramon Lawrence

introduction
Introduction
  • Databases are part of our lives
  • Hash Join is a core database algorithm
    • Very I/O intensive for large databases
      • Queries may take hours
    • Any performance improvement is significant
  • Real datasets contain skew
    • Skew is when some values occur more frequently
    • Skew can greatly reduce hash join performance
  • Skew traditionally considered a bad thing for join algorithms
    • Try to mitigate negative effects of skew
  • Adapt hash join
    • No longer just mitigate
    • Use foreknowledge of skew
      • Improve performance
example relations
Example Relations

Probe Relation

Build Relation

Part

Purchase

dhj algorithm build phase
DHJ Algorithm Build Phase

Hash Function: modulo 5

statistics and hash joins
Statistics and Hash Joins
  • Modern database systems maintain statistics such as histograms for query optimization
  • What if hash join could use the statistics to choose the best build tuples to keep in memory?
    • Does not have to generate own statistics
histojoin algorithm general idea
Histojoin Algorithm General Idea
  • Same basic form as DHJ
  • Determines best build tuples from histogram
    • In this case the tuples with partid 2 and 3
  • Create partitions for the best build tuples
    • In addition to regular partitions
    • Freeze regular partitions first
  • Perform a highly optimized multi-stage check
    • To determine the partition tuples belong in
implementation details
Implementation Details
  • Avoided in algorithm description
    • General enough to fit any database system
  • But ultimately important
    •  Core of algorithm implementation specific
  • Implemented in
    • Stand alone Java app
      • Optimistic implementation
    • PostgreSQL
      • HHJ
      • Conservative implementation
inaccurate statistics
Inaccurate Statistics
  • Selections
  • Multi-join plans
    • Sampling
    • SITs
  • Handling dependent on implementation
    •  PostgreSQL conservative memory usage
experimental results
Experimental Results
  • TPC-H
    • Database commonly used to test database system performance
    • Skewed versions
    • 1GB dataset used in Java tests
    • 10GB dataset used in PostgreSQL tests
experimental results cont
Experimental Results, cont.

Java, Lineitem/Part, skewed, 1GB

Approx. 20% faster

experimental results cont1
Experimental Results, cont.

Java, Lineitem/Part,high skew, 1GB

Approx. 60% faster

experimental results cont2
Experimental Results, cont.

Java, Various Joins, Percent Improvement, 1GB

Approx. 20% for skewed and 60% for high skew

experimental results cont3
Experimental Results, cont.

Java, Lineitem/Part, Inaccurate Histogram, 1GB

experimental results cont4
Experimental Results, cont.

Java, Lineitem/Part/Supplier,high skew, 1GB

Approx. 75% faster

experimental results cont5
Experimental Results, cont.

PostgreSQL, Lineitem/Part,skewed, 10GB

Approx. 10% faster

experimental results cont6
Experimental Results, cont.

PostgreSQL, Lineitem/Part, high skew, 10GB

Approx. 60% faster

experimental results cont7
Experimental Results, cont.

PostgreSQL, Various Joins, Percent Improvement, 10GB

5-10% for skewed and 50-60% for high skew

conclusion
Conclusion
  • Histojoin
    • significantly outperforms standard hash joins in the presence of skew
  • Smart implementation mitigates pitfalls
  • Two papers have been published from this work
  • PostgreSQL patch currently in review
    • Will be used by millions of users
thank you

Thank you

Thank you Dr. Lawrence

ad