1 / 9

Sorting by the Numbers

Sorting by the Numbers. Sorting Part Four. Question. Suppose you are given the task of writing an application to sort a big data file. What do you need to know to pick a good solution? File Size = 1 GB Record Size = 250 Bytes Available Memory = ¼ GB. How many Runs? How big is each Run?.

leora
Download Presentation

Sorting by the Numbers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sorting by the Numbers Sorting Part Four

  2. Question • Suppose you are given the task of writing an application to sort a big data file. What do you need to know to pick a good solution? • File Size = 1 GB • Record Size = 250 Bytes • Available Memory = ¼ GB

  3. How many Runs?How big is each Run? • Total Records to Process • 1 billion bytes in the file • 250 bytes for each record • = 4 million records in the file • Run Size • 1GB file • ¼ GB memory • = 4 Runs of 1 million records each

  4. Time to Create the Runs • Sorting One Run • Using either Quicksort or Ordered Binary Tree • N log2 N • 1million * 20 • approximately 20 million comparisons of internal memory locations • Sorting Four Runs • 80 million internal memory comparisons

  5. Refresher on Merging Files File One 1 3 5 7 9 File Two 2 4 6 8 10 File One 1 2 3 4 5 File Two 6 7 8 9 10 So, to merge 2 files of N random records each, requires 2N compares And, to merge 2 files where the runs were built from a sorted file requires N compares

  6. Merging the Four Files R1 R2 R1 R2 R3 R4 2 million compares 2 million compares 2 million compares T1 R3 T1 T2 3 million compares 4 million compares T2 R4 4 million compares

  7. Total Processing Time • Time to Create the 4 Runs • 80 million comparisons • Time to Merge the 4 Runs • 8 million comparisons • Assuming a File Read takes just 100 times longer than a Memory Read • Total Time = 880 million time units • note, we have omitted the time to read the runs into memory and to write the runs to temp files

  8. Second Example • 2 Runs of 2 Million Records each • Internal Sorting • N log2 N = 2million * 24 = 48 million compares • 96 million to create both runs • File Merging • 4 million compares • Total Time • 496 million time units

  9. Next in this course So how much time does it take to access the disk?

More Related