chap 8 cosequential processing and the sorting of large files n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chap 8. Cosequential Processing and the Sorting of Large Files PowerPoint Presentation
Download Presentation
Chap 8. Cosequential Processing and the Sorting of Large Files

Loading in 2 Seconds...

play fullscreen
1 / 71

Chap 8. Cosequential Processing and the Sorting of Large Files - PowerPoint PPT Presentation


  • 199 Views
  • Uploaded on

File Structures by Folk, Zoellick, and Riccardi. Chap 8. Cosequential Processing and the Sorting of Large Files. 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주. Chapter Objectives(1). Describe a class of frequently used processing activities known as cosequential process

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chap 8. Cosequential Processing and the Sorting of Large Files' - jess


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chap 8 cosequential processing and the sorting of large files

File Structures by Folk, Zoellick, and Riccardi

Chap 8. Cosequential Processing and the Sorting of Large Files

서울대학교 컴퓨터공학부

객체지향시스템연구실

SNU-OOPSLA-LAB

교수 김 형 주

SNU-OOPSLA Lab.

chapter objectives 1
Chapter Objectives(1)
  • Describe a class of frequently used processing activities known as cosequential process
  • Provide a general object-oriented model for implementing varieties of cosequential processes
  • Illustrate the use of the model to solve a number of different kinds of cosequential processing problems, including problems other than simple merges and matches
  • Introduce heapsortas an approach to overlapping I/O with sorting in RAM

SNU-OOPSLA Lab.

chapter objectives 2
Chapter Objectives(2)
  • Show how merging provides the basis for sorting very large files
  • Examine the costs of K-way merges on disk and find ways to reduce those costs
  • Introduce the notion of replacement selection
  • Examine some of the fundamental concerns associated with sorting large files using tapes rather than disks
  • Introduce UNIX utilities for sorting, merging, and cosequential processing

SNU-OOPSLA Lab.

contents
Contents

8.1 Cosequential operations

8.2 Application of the OO Model to a General Ledger Program

8.3 Extension of the OO Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix

SNU-OOPSLA Lab.

cosequential operations

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Cosequential operations
  • Coordinated processing of two or more sequential lists to produce a single list
  • Kinds of operations
    • merging, or union
    • matching, or intersection
    • combination of above

SNU-OOPSLA Lab.

matching names in two lists 1

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Matching Names in Two Lists(1)
  • So called “intersection operation”
  • Output the names common to two lists
  • Things that must be dealt with to make match procedure work reasonably
    • initializing that is to arrange things
    • methods that are getting and accessing the next list item
    • synchronizing between two lists
    • handling EOF conditions
    • recognizing errors

e.g. duplicate names or names out of sequence

SNU-OOPSLA Lab.

matching names in two lists 2

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Matching Names in Two Lists(2)
  • In comparing two names
    • if Item(1) is less than Item(2), read the next from List 1
    • if Item(1) is greater than Item(2), read the next name from List 2
    • if the names are the same, output the name and read the next names from the two lists

SNU-OOPSLA Lab.

slide8

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Cosequential match procedure(1)

PROGRAM: match

Item(1)

Item(1) < Item(2)

List 1

same

name

use input() & initialize() procedure

List 2

Item(1) > Item(2)

Item(2)

SNU-OOPSLA Lab.

cosequential match procedure 2

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Cosequential match procedure(2)

int Match(char * List1, char List2, char *OutputList)

{

int MoreItems; // true if items remain in both of the lists

// initialize input and output lists

InitializeList(1, List1); InitializeList(2, List2);

InitializeOutput(OutputList);

// get first item from both lists

MoreItems = NextItemInLIst(1) && NextItemInList(2);

while (MoreItems) { // loop until no items in one of the lists

if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);

else if (Item(1) == Item (2) ) {

ProcessItem(1); // match found

MoreItems = NextItemInList(1) && NextItemInList(2);

}

else MoreItems = NextItemInList(2); // Item(1) > Item(2)

}

FinishUp(); return 1;

}

SNU-OOPSLA Lab.

slide10

8.1 An Object-Oriented Model for Implementation Cosequential Processes

General Class for Cosequential Processing(1)

template <class ItemType> class CosequentialProcess

// base class for cosequential processing

{ public:

// the following methods provide basic list processing

// these must be defined in subclasses

virtual int InitializeList (int ListNumber, char *LintName) = 0;

virtual int InitializeOutput (char * OutputListName) = 0;

virtual int NextItemInList (int ListNumber) = 0;

// advance to next item in this list

virtual ItemType Item(int ListNumber) = 0;

// return current item from this list

virtual int ProcessItem(int ListNumber) = 0;

// process the item in this list

virtual int FinishUp() = 0; // complete the processing

// 2-way cosequential match method

virtual int Match2Lists (char *List1, char * List2, char *OutputList);

};

SNU-OOPSLA Lab.

general class for cosequential processing 2

8.1 An Object-Oriented Model for Implementation Cosequential Processes

General Class for Cosequential Processing(2)
  • A Subclass to support lists that are files of strings, one per line

class StringListProcess : public CosequentialProcess<String &>

{ public:

StringListProcess (int NumberOfLists); // constructor

// Basic list processing methods

int InitializeList (int ListNumber, char * List1);

int InitializeOutput(char * OutputList);

int NextItemInList (int ListNumber); // get next

String & Item (int ListNumber); // return current

int ProcessItem (int ListNumber); // process the item

int FinishUp(); // complete the processing

protected:

ifstream * List; // array of list files

String * Items; // array of current Item from each list

ofstream OutputLsit;

static const char * LowValue; //used so that NextItemInList() doesn’t // have to get the first item in an special way

static const char * HighValue;

};

SNU-OOPSLA Lab.

general class for cosequential processing 3

8.1 An Object-Oriented Model for Implementation Cosequential Processes

General Class for Cosequential Processing(3)
  • Appendix H: full implementation
  • An example of main

#include “coseq.h”

int main()

{

StringListProcess ListProcess(2); // process with 2 lists

ListProces.Match2Lists (“list1.txt”, “list2.txt”, “match.txt”);

}

SNU-OOPSLA Lab.

merging two lists 1

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Merging Two Lists(1)
  • Based on matching operation
  • Difference
    • must read each of the lists completely
    • must change MoreNames behavior
      • keep this flag set to true as long as there are records in either list
  • HighValue
    • the special value (we use “\xFF”)
    • come after all legal input values in the files to ensure both input files are read to completion

SNU-OOPSLA Lab.

merging two lists 2

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Merging Two Lists(2)
  • Cosequential merge procedure based on a single loop
    • This method has been added to class CosequentialProcess
    • No modifications are required to class StringListProcess

template <class ItemType>

int CosequentialProcess<ItemType> :: Merge2Lists

(char * List1Name, char * List2Name, char * OutputList)

{

int MoreItems1, MoreItems2; // true if more items in list

(continued … )

SNU-OOPSLA Lab.

merging two lists 3

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Merging Two Lists(3)

InitializeList (1 List1Name);

InitializeList (2, List2Name);

InitializeOutput (OutputListName);

MoreItems1 = NextItemInList(1);

MoreItems2 = NextItemInLIst(2);

while (MoreItems1 || MoreItems(2) ) { // if either file has more

if (Item(1) < Item(2)) { // list 1 has next item to be processed

ProcessItem(1);

MoreItem1 = NextItemInList(1);

}

else if (Item(1) == Item(2) ) {

ProcessItem(1);

MoreItems1 = NextItemInList(1);

MoreItems2 = NextItemInList(2);

}

else // Item(1) > Item(2) {

ProcessItem(2);

MoreItem2 = NextItemInList(2);

}

}

FinishUp(); return 1;

}

SNU-OOPSLA Lab.

slide16

8.1 An Object-Oriented Model for Implementation Cosequential Processes

(Item(1) < Item(2) )or match

NAME_1

List 1

OutputList

NAME_2

List 2

Item(1) > Item(2)

Cosequential merge procedure(1)

PROGRAM: merge

SNU-OOPSLA Lab.

summary of the cosequential processing model 1

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Summary of the CosequentialProcessing Model(1)
  • Assumptions
    • two or more input files are processed in a parallel fashion
    • each file is sorted
    • in some cases, there must exist a high key value or a low key
    • records are processed in a logical sorted order
    • for each file, there is only one current record
    • records should be manipulated only in internal memory

SNU-OOPSLA Lab.

summary of the cosequential processing model 2

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Summary of the Cosequential Processing Model(2)
  • Essential Components
    • initialization -reads from first logical records
    • one main synchronization loop
    • - continues as long as relevant records remain
    • selection in main synchronization loop
    • Input files & Output files are sequence checked by comparing the previous item value with new one

if (Item(1) > Item(2) then ..........

else if ( Item(1) < Item(2)) then .........

else ........... /* current keys equal */

endif

SNU-OOPSLA Lab.

summary of the cosequential processing model 3

8.1 An Object-Oriented Model for Implementation Cosequential Processes

Summary of the Cosequential Processing Model(3)
  • Essential components (cont’d)
    • substitute high values for actual key when EOF
      • main loop terminates when high values have occurred for
      • all relevant input files
      • no special code to deal with EOF
    • I/O or error detection are to be relegated to supporting method so the details of these activities do not obscure the principal processing logic

SNU-OOPSLA Lab.

8 2 the general ledger program 1
8.2 The General Ledger Program (1)
  • Account table (Fig 8.6)

Acct-No Acct-Title Jan Feb Mar Apr

101 check #1 100 200 170

102 check #2 500 270 320

505 advertize 300 129 230

  • Journal entry table (Fig 8.7)

Acct-No Check-No Date Description Debit/Credit

101 112 04/02/86 auto-repair -30

505 213 05/13/86 newspaper -39

540 670 04/13/86 printer +60

  • Ledger Printout (Fig 8.8)

101 check #1

1271 04/02/86 auto-expense -78

1272 04/03/86 advertise -30

SNU-OOPSLA Lab.

8 2 the general ledger program 2
8.2 The General Ledger Program(2)
  • Ledger List and Journal List (Fig 8.10)

101 check#1 101 1271 Auto-expense

101 1272 Rent

101 1273 Advertising

102 check#2 102 670 Office-expense

  • The ledger (master) account number
  • The journal (transaction) account number
  • Class MasterTransactionProcess (Fig 8.12)
  • Subclass LedgeProcess (Fig 8.14)

SNU-OOPSLA Lab.

8 2 the general ledger program 3
8.2 The General Ledger Program (3)

Template <class ItemType>

class MasterTransactionProcess: Public CosequentialProcess<ItemType>

// a cosequential process that supports master/transaction processing

{public:

MasterTransactionProcess(); // constructor

Virtual int ProcessNewMaster() = 0; //processing when new master read

Virtual int ProcessCurrentMaster() = 0;

Virtual int ProcessEndMaster() = 0;

Virtual int ProcessTransactionError()= 0;

//cosequential processing of master and transaction records

int PostTransactions (char * MasterFileName, char * TransactionFileName, char * OutputListName);

};

SNU-OOPSLA Lab.

a k way merge algorithm

8.3 Extension of the Model to Include Multiway Merging

A K-way Merge Algorithm
  • A very general form of cosequential file processing
  • Merge K input lists to create a single, sequentially ordered output list
  • Algorithm
    • begin loop
    • determine which list has the key with the lowest value
    • output that key
    • move ahead one key in that list
      • in duplicate input entries, move ahead in each list
    • loop again

SNU-OOPSLA Lab.

slide24

8.3 Extension of the Model to Include Multiway Merging

Selection Tree for Merging Large Number of Lists

  • K-way merge
    • nice if K is no larger than 8 or so
    • if K > 8, the set of comparisons for minimum key is expensive
    • loop of comparison (computing)
  • Selection Tree (if K > 8)
    • time vs. space trade off
    • a kind of “tournament” tree
    • the minimum value is at root node
    • the depth of tree is log2 K

SNU-OOPSLA Lab.

slide25

8.3 Extension of the Model to Include Multiway Merging

Selection Tree

7, 10, 17....List 0

7

9, 19, 23....List 1

7

11, 13, 32....List 2

11

18, 22, 24....List 3

input

5

12, 14, 21....List 4

5

5, 6, 25....List 5

5

15, 20, 30....List 6

8

8, 16, 29....List 7

SNU-OOPSLA Lab.

8 4 a second look at sorting in memory

8.4 A Second Look at Sorting in Memory

8.4 A Second Look at Sorting in Memory
  • Read the whole file from into memory, perform sorting, write the whole file into disk
  • Can we improve on the time that it takes for this RAM sort?
    • perform some of parts in parallel
    • selection sort is good but cannot be used to sort entire file
  • Using Heap technique!
    • processing and I/O can occur in parallel
    • keep all the keys in heap
  • Heap building while reading a block
  • Heap rebuilding while writing a block

SNU-OOPSLA Lab.

overlapping processing and i o heapsort

8.4 A Second Look at Sorting in Memory

Overlapping processing and I/O : Heapsort
  • Heap
    • a kind of binary tree, complete binary tree
    • each node has a single key, that key is less than or equal to the key at its parent node
    • storage for tree can be allocated sequentially
    • so there is no need for pointers or other dynamic overhead for maintaining the heap

SNU-OOPSLA Lab.

slide28

8.4 A Second Look at Sorting in Memory

A heap in both its tree form and

as it would be stored in an array

A

(1)

* n, 2n, 2n+1 positions

B

c

(2)

(3)

E

H

I

D

(4)

(5)

(6)

(7)

F

(9)

G

(8)

1 2 3 4 5 6 7 8 9

A

B

C

E

H

I

D

G

F

SNU-OOPSLA Lab.

class heap and method insert 1

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(1)

class Heap

{ public:

Heap(int maxElements);

int Insert (char * newKey);

char * Remove();

protected:

int MaxElements; int NumElements;

char ** HeapArray;

void Exchange (int i, int j); // exchange element i and j

int Compare (int i, int j) // compare element i and j

{ return strcmp(Heaparray[i], HeapArray[j]); }

};

SNU-OOPSLA Lab.

class heap and method insert 2

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(2)

int Heap::Insert(char * newKey)

{

if (NumElements == MaxElements) return FALSE;

NumElements++; // add the new key at the last position

HeapAray[NumElements] = newKey;

// re-order the heap

int k = NumElements; int parent;

while(k > 1) { // k has a parent

parent = k/2;

if (Compare(k, parent) >= 0) break;

// HeapArray[k] is in the right place

// else exchange k and parent

Exchange(k, parent);

k = parent;

}

return;

}

SNU-OOPSLA Lab.

heap building algorithm 1

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(1)

input key order : F D C G H I B E A

New key to

be inserted

Heap, after insertion

of the new key

Selected heaps

in tree form

F 1 2 3 4 5 6 7 8 9

F

D 1 2 3 4 5 6 7 8 9

D F

C

C 1 2 3 4 5 6 7 8 9

C F D

D

F

G 1 2 3 4 5 6 7 8 9

C F D G

H 1 2 3 4 5 6 7 8 9

C F D G H

(continued....)

SNU-OOPSLA Lab.

heap building algorithm 2

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(2)

input key order : F D C G H B E A

New key to

be inserted

Heap, after insertion

of the new key

Selected heaps

in tree form

I 1 2 3 4 5 6 7 8 9

C F D G H I

C

F

D

B 1 2 3 4 5 6 7 8 9

B F C G H I D

G

H

I

E 1 2 3 4 5 6 7 8 9

B E C F H I D G

B

C

F

A 1 2 3 4 5 6 7 8 9

A B C E H I D G F

H

G

I

D

(continued....)

SNU-OOPSLA Lab.

slide33

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(3)

input key order : F D C G H B E A

Heap, after insertion

of the new key

New key to

be inserted

Selected heaps

in tree form

A 1 2 3 4 5 6 7 8 9

A B C E H I D G F

A

C

B

D

H

I

E

G

F

SNU-OOPSLA Lab.

illustration for overlapping input with heap building 1

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(1)

(Free ride of main memory processing: heap building is faster than IO!)

Total RAM area allocated for heap

First input buffer. First part of heap is built here. The

first record is added to the heap, then the second record

is added, and so forth

Second input buffer. This buffer is being filled

while heap is being built in first buffer.

SNU-OOPSLA Lab.

illustration for overlapping input with heap building 2

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(2)

(One Heap is growing during IO time!)

Second part of heap is built here. The first record is

added to the heap, then the second record, etc

Third input buffer. This buffer is filled while heap is being

built in second buffer

Third part of heap is built here

Fourth input buffer is filled while heap is being

built in third buffer

SNU-OOPSLA Lab.

sorting while writing to the file

8.4 A Second Look at Sorting in Memory

Sorting while Writing to the File
  • Heap rebuilding while writing a block

(Free ride of main memory processing)

  • Retrieving the keys in order (Fig 8.20)
    • while( there is no elements)
      • get the smallest value
      • put largest value into root
      • decrease the # of elements
      • reorder the heap
  • Overlapping retrieve-in-order with I/O
    • retrieve-in-order a block of records
    • while writing this block,

retrieve-in-order the next block

SNU-OOPSLA Lab.

8 5 merging as a way of sorting large files on disk

8.5 Merging as a Way of Sorting Large Files on Disk

8.5 Merging as a Way of Sorting Large Files on Disk
  • Keysort: holding keys in memory
  • Two Shortcomings of Keysort
    • substantial cost of seeking may happen after keysort
    • cannot sort really large files
      • e.g. a file with 800,000 records, size of each record: 100 bytes,
    • size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!
      • cannot even sort all the keys in RAM
  • Multiway merge algorithm
    • small overhead for maintaining pointers, temporary variables
    • run: sorted subfile
    • using heap sort for each run
    • split, read-in, heap sort, write-back

SNU-OOPSLA Lab.

slide38

8.5 Merging as a Way of Sorting Large Files on Disk

800,000 unsorted records

80 internal sorts

.............

80runs, each containing 10,000 sorted records

.............

Merge

800,000 records in sorted order

Sorting through the creation of runs

and subsequential merging of runs

SNU-OOPSLA Lab.

multiway merging k way merge sort

8.5 Merging as a Way of Sorting Large Files on Disk

Multiway merging (K-way merge-sort)
  • Can be extended to files of any size
  • Reading during run creation is sequential
    • no seeking due to sequential reading
  • Reading & writing is sequential
  • Sort each run: Overlapping I/O using heapsort
  • K-way merges with k runs
  • Since I/O is largely sequential, tapes can be used

SNU-OOPSLA Lab.

how much time does a merge sort take

8.5 Merging as a Way of Sorting Large Files on Disk

How Much Time Does a Merge Sort Take?
  • Assumptions
    • only one seek is required for any sequential access
    • only one rotational delay is required per access
  • Four I/Os ( refer to page of 39 )
    • during the sort phase
      • reading all records into RAM for sorting, forming runs
      • writing sorted runs out to disk
    • during the merge phase
      • reading sorted runs into RAM for merging
      • writing sorted file out to disk

SNU-OOPSLA Lab.

four steps 1

8.5 Merging as a Way of Sorting Large Files on Disk

Four Steps(1)
  • Step1: Reading records into RAM for sorting and forming runs
    • assume: 10MB input buffer, 800MB file size
    • seek time --> 8msec, rotational delay --> 3msec
    • transmission rate --> 0.0145MB/msec
    • Time for step1:
  • access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec
  • Step2: Writing sorted runs out to disk
    • writing is reverse of reading
    • time that it takes for step2 equals to time of step1

SNU-OOPSLA Lab.

four steps 2
Four Steps(2)
  • Step3: Reading sorted runs into RAM for merging
    • 10 MB of RAM is for storing runs. 80 runs
    • reallocate each of 80 buffers 10MB RAM as 80 input buffers
    • access each run 80 buffers to read all of it
    • Each buffer holds 1/80 of a run (0.125MB)
    • total seek & rotational time --> 80 runs X 80 seeks

--> 6400 seeks. 6400 X 11 msec = 70 seconds

    • transfer time --> 60 seconds
    • total time = total seek & rotation time + transfer time

SNU-OOPSLA Lab.

four steps 3

8.5 Merging as a Way of Sorting Large Files on Disk

Four Steps(3)
  • Step4: Writing sorted file out to disk
    • need to know how big output buffers are
    • with 20,000-byte output buffers,
    • total seek & rotation time = 4,000 x 11 msec
    • transfer time is still 60 seconds
  • Consider Table 8.1 (323pp)
  • What if we use keysort for 800M file? --> 24hrs 26mins 40secs

80,000,000 bytes

4,000 seeks

20,000 bytes per seek

SNU-OOPSLA Lab.

slide44

8.5 Merging as a Way of Sorting Large Files on Disk

Effect of buffering on the number of seeks required

10MB file

1st run = 80 buffers’ worth(80 accesses)

800MB file

2nd run = 80 buffers’ worth(80 accesses)

800,000

sorted records

:

:

:

80 buffers(10MB)

80th run = 80 buffers’ worth(80 accesses)

SNU-OOPSLA Lab.

sorting a very large file

8.5 Merging as a Way of Sorting Large Files on Disk

Sorting a Very Large File
  • Two kinds of I/O
    • Sort phase
      • I/O is sequential if using heapsort
      • Since sequential access is minimal seeking, wecannot algorithmically speed up I/O
    • Merge phase
      • RAM buffers for each run get loaded, reloaded at predictable times -> random access
      • For performance, look for ways to cut down on the number of random accesses that occur while reading runs
      • you can have some chance here!

SNU-OOPSLA Lab.

the cost of increasing the file size

8.5 Merging as a Way of Sorting Large Files on Disk

The Cost of Increasing the File Size
  • K-way merge of K runs
  • Merge sort = O(K2)( merge op. -> K2 seeks )
  • If K is a big number, you are in trouble!
  • Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)
    • more hardware (disk drives, RAM, I/O channel)
    • reducing the order of merge (k), increasing buffer size of each run
    • increase the lengths of the initial sorted runs
    • find the ways to overlap I/O operations

SNU-OOPSLA Lab.

hardware base improvements

8.5 Merging as a Way of Sorting Large Files on Disk

Hardware-base Improvements
  • Increasing the amount of RAM
    • longer & fewer initial runs
    • fewer seeks
  • Increasing the number of disk drives
    • no delay due to seek time after generation of runs
    • assign input and output to separate drives
  • Increasing the number of I/O channels
    • separate I/O channels, I/O can overlap
    • Improve transmission time

SNU-OOPSLA Lab.

decreasing the num of seeks using multiple step merges

8.5 Merging as a Way of Sorting Large Files on Disk

Decreasing the Num of Seeks Using Multiple-step Merges
  • K-way merge characteristics
    • a selection tree is used
      • the number of comparisons is N*log K

(K-way merge with N records)

    • K is proportional to N
      • O(N*log N) : reasonably efficient
  • Reducing seeks is to reduce the number of runs
    • give each run a bigger buffer space
    • multiple-step merge provides the way without more RAM

SNU-OOPSLA Lab.

multiple step merge 1

8.5 Merging as a Way of Sorting Large Files on Disk

Multiple-step merge(1)
  • Do not merge all runs at one time
  • Break the original set of runs into small groups and Merge runs in these group separately
  • Leads fewer seeks, but extra transmission time in second pass
  • Reads every record twice
    • to form the intermediate runs & the final sorted file
  • Similar to have selection tree in merging n lists!!

SNU-OOPSLA Lab.

slide50

8.5 Merging as a Way of Sorting Large Files on Disk

25 sets of 32 runs each

32 runs

32 runs

32 runs

......

......

......

......

......

Two-step merge of 800 runs

(25 sets X 32 runs) = 800 runs

SNU-OOPSLA Lab.

multiple step merge 2

8.5 Merging as a Way of Sorting Large Files on Disk

Multiple-step merge(2)
  • Essence of multiple-step merging
    • increase the available buffer space for each run
    • extra pass vs. random access decrease
  • Can we do even better with more than two steps?
    • trade-offs between the seek&rotation time and the transmission time
  • major cost in merge sort
    • seek, rotation time, transmission time, buffer size, number of runs

SNU-OOPSLA Lab.

increasing run lengths using replacement selection 1

8.5 Merging as a Way of Sorting Large Files on Disk

Increasing Run Lengths Using Replacement Selection(1)
  • Facts of Life
    • Want to use the heap sort in memory
    • Want to allocate longer output runs
    • Can we pack the longer output runs using the heap sort in memory?
  • Replacement Selection
    • Idea
      • always select the key from memory that has the lowest value
      • output the key
      • replace it with a new key from the input list
      • use 2 heaps in the memory buffer

SNU-OOPSLA Lab.

(continued...)

increasing run lengths using replacement selection 2

8.5 Merging as a Way of Sorting Large Files on Disk

Increasing Run Lengths Using Replacement Selection(2)
  • Implementation
    • step1: read records and sort using heap sort
      • this heap is the primary heap
    • step2: write out only the record with the lowest value
    • step3: bring in new record and compare its key with that of record just output
    • step3-a: if the new key is higher, insert new record into its proper in the primary heap along with the other records selected for output
    • step3-b: if the new key is lower, place the record in a secondary heap with key values lower than already written out
    • step4: repeat step 3 while there are records in the primary heap and there are records to be read in. When the primary heap is empty, make the secondary heap into the primary heap and repeat step2 & step3

SNU-OOPSLA Lab.

slide54

8.5 Merging as a Way of Sorting Large Files on Disk

Example of the principle underlying

replacement selection

Input:

21, 67, 12, 5, 47, 16

Front of input string

(Heap sort!)

Remaining input

Memory(p=3)

Output run

-

5

12, 5

16, 12, 5

21, 16, 12, 5

47, 21, 16, 12, 5

67, 47, 21, 16, 12, 5

21, 67, 12

21, 67

21

-

-

-

-

5 47 16

12 47 16

67 47 16

67 47 21

67 47 -

67 - -

- - -

SNU-OOPSLA Lab.

replacement selection 1

8.5 Merging as a Way of Sorting Large Files on Disk

Replacement Selection(1)
  • What happens if a key arrives in memory too late to be output into ins proper position relative to the other keys? (if 4th key is 2 rather than 12)
    • use of second heap, to be included in next run
    • refer to page 335 Figure 8.25
  • Two questions
    • Given P locations in memory, how long a run can we expect replacement selection to produce, on the average?
      • On the average, we can expect a run length of 2P
      • Knuth provides an excellent description (page 335-336)

(continued...)

SNU-OOPSLA Lab.

slide56

8.5 Merging as a Way of Sorting Large Files on Disk

Comparisons of access times required to sort 8 million records

both RAM sort and replacement selection

Total Seek &

Rotation Delay

Time

# of Seeks

Required to

Form Runs

# of Records

per Seek to

Form Runs

Size of

Runs

Formed

Merge

Order

Used

Total

Number

of Seeks

Approach

(hr) (min)

800 RAM

sorts followed

by an 800-way

merge

1,600

10,000

10,000

800

681,600

4

58

Replacement

selection followed

by 534-way merge

(records in random

order)

2,500

15,000

534

6,400

521,134

3

48

Replacement

selection followed

by 200-way merge

(records partially

ordered)

2,500

40,000

200

200

206,400

1

30

SNU-OOPSLA Lab.

slide57

8.5 Merging as a Way of Sorting Large Files on Disk

Step-by-step op. of replacement selection with 2 heaps

working to form two sorted runs(1)

Input

33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16

Front of input string

(Heap sort!)

Remaining input

Memory(P=3)

Output run(A)

33, 18, 24, 58, 14, 17, 7, 21, 67, 12

33, 18, 24, 58, 14, 17, 7, 21, 67

33, 18, 24, 58, 14, 17, 7, 21

33, 18, 24, 58, 14, 17, 7

33, 18, 24, 58, 14, 17

33, 18, 24, 58, 14

33, 18, 24, 58

5 47 16

12 47 16

67 47 16

67 47 21

67 47 ( 7)

67 (17) ( 7)

(14) (17) ( 7)

-

5

12, 5

16, 12, 5

21, 16, 12, 5

47, 21, 16, 12, 5

67, 47, 21, 16, 12, 5

SNU-OOPSLA Lab.

slide58

8.5 Merging as a Way of Sorting Large Files on Disk

Step-by-step op. of replacement selection

working to form two sorted runs(2)

Remaining input

Memory(P=3)

Output run(B)

First run complete; start building the second

33, 18, 24, 58

33, 18, 24

33, 18

-

-

-

14 17 7

14 17 58

24 17 58

24 18 58

24 33 58

- 33 58

- - 58

-

-

7

14, 7

17, 14, 7

18, 17, 14, 7

24, 18, 17, 14, 7

33, 24, 18, 17, 14, 7

58, 33, 24, 18, 17, 14, 7

SNU-OOPSLA Lab.

replacement selection plus multiple merging

8.5 Merging as a Way of Sorting Large Files on Disk

Replacement Selection Plus Multiple Merging
  • Total number of seeks is less than for the one-step merges
  • The two-step merge requires transferring the data two more times than do the one-step merge
    • the two-step merges & replacement selection are still better, but the results are less dramatic
    • refer to table of the next slide

SNU-OOPSLA Lab.

slide60

8.5 Merging as a Way of Sorting Large Files on Disk

Comparison of merges, considering transmission times(1)

:1-step merge

Approach

Number of

Records per

Seek to

Form Runs

Merge

Pattern

Used

Number

of Seeks

for Sorts

and Merges

Seek +

Rotational

Delay

Time(min)

Total

Passes

over the

File

Total

Trans-

mission

Time(min)

Total of Seek,

Rotation, and

Transmission

Times(min)

RAM sorts

800-

way

681,700

4

43

341

10,000

298

replacement

selection

(records in

random order)

534-

way

2,500

228

4

43

341

521,134

replacement

selection

(records part

-ially ordered)

2,500

200-

way

90

4

43

206,400

341

SNU-OOPSLA Lab.

(continued...)

slide61

8.5 Merging as a Way of Sorting Large Files on Disk

Comparison of merges, considering transmission times(2)

:2-step merge

Approach

Number of

Records per

Seek to

Form Runs

Merge

Pattern

Used

Number

of Seeks

for Sorts

and Merges

Seek +

Rotational

Delay

Time(min)

Total

Passes

over the

File

Total

Trans-

mission

Time(min)

Total of Seek,

Rotation, and

Transmission

Times(min)

25 x 32

-way

(one 25-way)

RAM sorts

127,200

6

65

121

10,000

56

replacement

selection

(records in

random order)

19 x 28

-way

(one 19-way)

2,500

55

6

65

120

124,438

20 x 10

-way

(one 20-way)

replacement

selection

(records part

-ially ordered)

2,500

110,400

48

6

65

113

SNU-OOPSLA Lab.

using two disks with replacement selection

8.5 Merging as a Way of Sorting Large Files on Disk

Using Two Disks with Replacement Selection
  • Two disk drives
    • input & output can overlap
      • reduce transmission by 50%
    • seeking is virtually eliminated
  • Sort phase
    • the run selection & output can overlap
  • Merge phase
    • output disk becomes input disk, and vice versa
    • seeking will occur on input disk, output is sequential
  • substantially reducing merge & transmission time

SNU-OOPSLA Lab.

slide63

8.5 Merging as a Way of Sorting Large Files on Disk

Memory organization for replacement selection

disk1

input

buffers

heap

disk2

output

buffers

SNU-OOPSLA Lab.

more drives more processors

8.5 Merging as a Way of Sorting Large Files on Disk

More Drives? More Processors?
  • More drives?
    • Until I/O becomes so fast that processing cannot keep up with it
  • More processors?
    • mainframes
    • vector and array processors
    • massively parallel machines
    • very fast local area networks

SNU-OOPSLA Lab.

effects of multiprogramming

8.5 Merging as a Way of Sorting Large Files on Disk

Effects of Multiprogramming
  • Increase the efficiency of overall system by overlapping processing and I/O
  • Effects are very hard to predict

SNU-OOPSLA Lab.

a concept toolkit for external sorting

8.5 Merging as a Way of Sorting Large Files on Disk

A Concept Toolkit for External Sorting
  • For in-RAM sorting, use heapsort
  • Use as much RAM as possible
  • Use a multiple-step merge when
    • the number of initial runs is so long that seek and rotation time is much greater than transmission time
  • Use replacement selection when
    • possibility of partially ordered
  • Use more than one disk drive and I/O channel
    • read/write can overlap
  • Look for ways to take advantage of new architecture and systems
    • parallel processing or high-speed networks

SNU-OOPSLA Lab.

sorting files on tape
Sorting Files on Tape
  • Balanced Merge with several tape drivers

Tape contains runs

T1 R1 R3 R5 R7 R9

Step1 T2 R2 R4 R6 R8 R10

T3 --

T4 --

Figure 8.28 (2 way-balanced 4 tape merge)

  • P is the number of passes, N is the number of runs, k is the number of input drivers ==> then, P = ceiling of (logkN)
  • 4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes
  • 20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

SNU-OOPSLA Lab.

sorting files on tape1
Sorting Files on Tape
  • Other ways of Balanced Merge

(Fig 8.30)T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 -- --

Step2 -- -- 2 2 2 2 2

Step3 4 4 .. 2 --

Step4 -- -- -- 10

(Fig 8.31) T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 --

Step2 …1 1 1 .. 1 -- 3 3

Step3 … 1 1 -- 5 .3

Step4 …. 1 4 5 --

Step5 -- -- -- 10

SNU-OOPSLA Lab.

k way balanced merge on tapes
K-way Balanced Merge on Tapes
  • Some difficult questions
    • How does one choose an initial distribution that leads readily to an efficient merge pattern?
    • Are there algorithmic descriptions of the merge patterns, given an initial distribution?
    • Given N runs and J tape drives, is there some way to compute the optimal merging performance so we have a yardstick against which to compare the performance of any specific algorithm?

SNU-OOPSLA Lab.

unix sorting and cosequential processing
Unix: Sorting and Cosequential Processing
  • Sorting in Unix
    • The Unix sort command
    • The qsort library routine
  • Cosequential processing utilities in Unix
    • Compares: cmp
    • Difference: diff
    • Common: comm

SNU-OOPSLA Lab.

let s review
Let’s Review !!

8.1 Cosequential operations

8.2 Application of the Model to a General Ledger Program

8.3 Extension of the Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix

SNU-OOPSLA Lab.