1 / 42

# 8. External Sorting - PowerPoint PPT Presentation

8. External Sorting. Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about '8. External Sorting' - lamond

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
8. External Sorting

Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer.

What shall we do?

Need to use EXTERNALSTORAGE DEVICE !!!

External Sorting

- Disk Sort

- Tape Sort

What is a major difference between two external sorts?

k - way merging

“mergesort”

merge

internal sort

.

.

.

.

.

.

.

.

.

.

.

.

4500 records

250 records/block

available memory = 3 blocks

Def’n : A segment of a file is said to be a run if all the records in the segment are sorted.

1 2 3 4 5 6

I

1 3 5

D1 ……

2 4 6

D2 ……

D1D2

……

6 n

D3D4

: the size of a run

Run size 2 4 6 8

1 3 5 7 2 4 6 8

3

12 34 56 78

6

1256 3478

12

12345678

24

How many passes?

1 + log2r

(r # of initial runs)

k-way merging

… … …… …

……

logkr ……………………………………………….

……

# of passes

1+logkr

# of I/O operations?

O(nlogkr)

better than 2-way merging !!!

Is k-way merging always better than 2-way merging?

… … …… …

……

……………………………………………….

……

# of passes

1+logkr  #(P)

#(P)  k r

r run size 

# of comparisons(k-way merge)

16 38 30 25 50 16 110 20

15 20 20 25 15 11 120 18

10 9 20 15 8 9 90 17

10

9

20

15

8

9

90

17

8

9

10

11

12

13

14

15

9

15

8

17

4

5

6

7

9

8

2

3

1

8

8

8

nlog2k why?

Total # of comparisons?

(# of passes) (# of comparisons in a pass)

= (logkr)(nlog2k)

= (nlog2r) independent of k !!!

#(c)  r 

x1, x2, x3,…,xm, xm+1, xm+2, xm+3,…,x2m, x2m+1, x2m+2, x2m+3,…

m keys m keys m keys

r = # of runs =   Any improvement?

Observation

See p.94 in textbook

!!!

…...

11

11

2

5

4,2,32,12,18,24,91,11

(record size >> the size of pointer)

why do we need this?

91

11

6

24

3

18

7

18

4 parent

2 loser

32

12 Updating pointers

18 ptr := winner.parent;

24 while ptr  nil do

91 if (ptr.loser.key < winner.key) then

11 interchange(ptr.loser, winner);

end {if}

ptr := ptr.parent;

end {while}

11

91

24

18

winner

Exercise :

In a complete 2-tree(T) with n leaf nodes,

show that

total # of nodes in T = 2n -1

(Average size of runs)

m0  # of records in (real) memory.

H. Seward (M.S. Thesis, MIT, 1954)

gave a good reason to believe that a run contains more than 1.5m0 records

(no proof)

E. Friend (JACM, 3, (1966))

experiment  2m0

E. Moore (1961)

Proved that 2m0 is the expected run length.

Snowplow

falling snow

2m0 m0

uniform distribution  2m0

• Balanced k-way merging

(similar to disk sorting)

• Polyphase merging 

• (R1, R2, …, R5000)

• length (Ri)  20 bytes

• Only 1000 records fitted in the internal memory at one time.

( 20k bytes)

• 4 tapes available

Balanced 2-way merge

T1T2T3T4

R1,1000R1001,2000

R2001,3000R3001,4000  

R4001,5000

  R1,2000R2001,4000

R4001,5000

R1,4000R4001,5000  

  R1,5000 

Total # of operations = 15000

R1,1000R1001,2000R2001,3000 

R3001,4000R4001,5000

(rewind)

R3001,4000R4001,5000  R1,3000

R1,5000 

• Total # of I/O operations

3000 + 5000 = 8000

Balanced Merge is not always best !!!

Tape 1 Tape 2 Tape 3

R1,1000 R1001,2000

R2001,3000 R3001,4000 

R4001,5000

R1,2000

R2001,4000

R4001,5000

R1,2000 R2001,4000

R4001,5000

R1,4000

R4001,5000  

R4001,5000 R1,4000

R1,5000 

Total # of I/O Operations

5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!

R1,1000 R1001,2000

R2001,3000 R3001,4000 

R4001,5000

R1,2000

R4001,5000 R2001,4000

(rewind)

R1,2000; 4001,5000

(rewind)

R1,5000  

Total # of I/O Operations

4000 + 3000 + 5000 = 11,000 !!!

T1T2T3T4T5T6

131 130 128 124 116 

115 114 112 18  516

17 16 14  98 58

13 12  174 94 54

11  332 172 92 52

 651 331 171 91 51

1291     

T1T2T3T4T5T6

155 150 141 129 115 

140 135 126 114  515

Pass 1 126 121 112  414 515

114 19  312 414 515

15  29 312 414 515

( 15 29 312 414 515)

155  24 37 49 510

155 144  33 45 56

Pass 2 155 144 123  42 53

155 144 123 92  51

(155 144 123 92 51 )

154 143 122 91 551

153 142 121  501 551

Pass 3 152 141  411 501 551

151  291 411 501 551

( 151 291 411 501 551)

Pass 4 1901     

T1T2T3T4T5T6

phase 1 131 130 128 124 116 

2 115 114 112 18  516

3 17 16 14  98 58

4 13 12  174 94 54 Gilstad(1960)

5 11  332 172 92 52

6 651 331 171 91 51

71291     

{{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4},

{16,15,14,12,8},{31,30,28,24,16}}

Perfect Fibonacci Distribution !!!

What is the underlying rule?

iaibicidiei

0 1 0 0 0 0

1 1 1 1 1 1

2 2 2 2 2 1

3 4 4 4 3 2

4 8 8 7 6 4

5 16 15 14 12 8

6 31 30 28 24 16

(a0 + b0) (a0 + c0) (a0 + d0) (a0 + e0) a0

(a1 + b1) (a1 + c1) (a1 + d1) (a1 + e1) a1

(a2 + b2) (a2 + c2) (a2 + d2) (a2 + e2) a2

n an bn cn dn en

n+1 an + bn an + cn an + dn an + en an

an  bn  cn dn en

iaibicidiei output

0 1 0 0 0 0 T6

1 1 1 1 1 1 T1

2 2 2 2 2 1 T2

3 4 4 4 3 2 T3

2 2 2 1 0 2

1 1 1 0 1 1

4 8 8 7 6 4 T4

5 16 15 14 12 8 T5

6 31 30 28 24 16 T6

7 61 59 55 47 31

T1T2T3T4T5

n-1 an-1 bn-1 cn-1 dn-1 en-1

n an-1+bn-1 an-1+cn-1 an-1+dn-1 an-1+en-1 an-1

an bn cn dn en

 en = an-1

dn = an-1 + en = an-1 + an-2

cn = an-1 + dn-1 = an-1 + (an-2 + en-2) = an-1 + an-2 + an-3

………….

en = an-1

dn = an-1 + an-2

cn = an-1 + an-2 + an-3

bn = an-1 + an-2 + an-3 + an-4

an = an-1 + an-2 + an-3 + an-4 + an-5

(a0 = 1, ai = 0, i = -1, -2, -3, -4)

e = an-1

d = an-1 + an-2

c = an-1 + an-2 + an-3

b = an-1 + an-2 + an-3 + an-4

a = an-1 + an-2 + an-3 + an-4 + an-4

i -4 -3 -2 -1 0 1 2 3 4 5 6 7

ai 0 0 0 0 1 1 2 4 8 16 31 61

1

bi 0

ci 0

di 0

ei 0

1 2 4 8 15 30 59

1 2 4 7 14 28 55

1 2 3 6 12 24 47

1 1 2 4 8 16 31

ai = < 0, 0, 0, 0, 1, 1, 2, 4, 8, 16, 31, 61, …… >, i = -4, -3, -2, -1, 0, 1, 2,...

“The kth order Fibonacci number”

Fnk = Fn-1k + Fn-2k + …… + Fn-kk

0, 0  nk-2

Fnk=

1, n = k-1

e.g)

The second order Fibonacci number

0 1 1 2 3 5 ……

Fn2 = Fn-12 + Fn-22

0, if n = 0

Fn2 =

1, if n = 1

Fibonacci number !!!

an = Fn+k-1k if k tapes(input) are used

why?

Use dummy runs !!!

5 input tapes and 53 initial runs.

Level T1 T2 T3 T4 T5

1 1 1 1 1 1 5

2 2 2 2 2 1 9

1 1 1 1 0

3 4 4 4 3 2 17

2 2 2 1 1

4 8 8 7 6 4 33

4 4 3 3 2

5 16 15 14 12 8 65>53

(8 7 7 6 4)

………………………………

T1 T2 T3 T4 T5

(34)

(35) (36) (37)

(38) (39) (40) (41)

(42) (43) (44) (45)

(46) (47) (48) (49) (50)

(51) (52) (53)  

    

    

T1 T2 T3 T4 T5 T6

(2) (2) (2) (3) (3)

18 17 16 14 58

(2) (2) (2) (3) 55

53

not best

but simple and good !!!

For better one, see Knuth !!!

T1 T2 T3

(k)8 (k)5 

(k)3 (2k)5

 (3k)3 (2k)2 0, 1, 1, 2, 3, 5, 8

(5k)2 (3k)1 

(5k)1 (8k)1

 (13k)1

Runs on two input tapes

(k)

# of runs run size(k) # of pairs # of I/O’s

8,5 1,1 5 10

5,3 2,1 3 9

3,2 3,2 2 10

2,1 5,3 1 8

1,1 8,5 1 13

1 13

How many passes over the data?

Total number  Fs for some s.

of initial runs

the sth Fibonacci number

Fs

Fs-1 Fs-2

T1 T2 T3

Fs-1 Fs-2 

Fs-3 Fs-2

Fs-3 Fs-4

…………

See Fig. p.107, textbook !!!

Total # of I/O operations =

 # of passes =

[proof] (By induction on S)

(s=2) LHS =

RHS =

(s=3) LHS =

RHS =

(s=k) Suppose that

(s=k+1)

Exercise !!!

See page 106-107 in textbook !!!

# of passes =

Fs = r

(1)

why?

. Golden Ratio !!!

From (1) ,

Fs-1 Fs-2

Polyphase merge

merge 3 tapes

Fs = r = # of initial runs

# of passes = 1.04 log2r

Tapes Phases Passes Pass/phase Growth

percent ratio

3 2.078 lnS + 0.672 1.504 lnS + 0.992 72 1.6180340

4 1.641 lnS + 0.364 1.015 lnS + 0.965 62 1.8392868

5 1.524 lnS + 0.078 0.863 lnS + 0.921 57 1.9275620

6 1.479 lnS + 0.185 0.795 lnS + 0.864 54 1.9659482

7 1.460 lnS + 0.424 0.762 lnS + 0.797 52 1.9835828

8 1.451 lnS + 0.642 0.744 lnS + 0.723 51 1.9919642

9 1.447 lnS + 0.838 0.734 lnS + 0.646 51 1.9960312

10 1.445 lnS + 1.017 0.728 lnS + 0.568 50 1.9980295

20 1.443 lnS + 2.170 0.721 lnS– 0.030 50 1.9999981

APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING

Tapes Phases Passes Growth

ratio

3 2.078 lnS + 0.672 1.504 lnS + 0.992 1.6180840

4 1.235 lnS + 0.754 1.012 lnS + 0.820 2.2469796

5 0.946 lnS + 0.796 0.897 lnS + 0.800 2.8793852

6 0.796 lnS + 0.821 0.773 lnS + 0.808 3.5133371

7 0.703 lnS + 0.839 0.691 lnS + 0.822 4.1481149

8 0.639 lnS + 0.852 0.632 lnS + 0.834 4.7833861

9 0.592 lnS + 0.861 0.587 lnS + 0.845 5.4189757

10 0.555 lnS + 0.869 0.552 lnS + 0.854 6.0547828

20 0.397 lnS + 0.905 0.397 lnS + 0.901 12.4174426

Level aibicidiei

0 1 0 0 0 0

1 1 1 1 1 1

2 5 4 3 2 1

3 15 14 12 9 5

4 55 50 41 29 15

nanbncndnen

n+1 an+bn+cnan+1bn+1cn+1dn+1

+dn+en -en -dn -cn -bn

an+1an

Perfect dist’n

for detail see Knuth Vol III !!!