Module 2 Non Linear Data Structures and Applications

Module 2 Non Linear Data Structures and Applications Supratim Biswas Computer Science & Engineering Department Indian Institute of Technology, Bombay

Module 2 Schedule

Module 2 Contents • Prerequisite : Material covered in Module 1 or equivalent • Hash table data structure with basic operations; hashing function and collision resolution; implementation using array and linked list; comparison of hash table with other searching algorithms. • Doubly and multiply linked data structures with basic operations; use in implementating binary tree and graph data structures. Comparison of linked data structures with space and time cost for basic operations. • Implementation of multiply linked lists in C++; template class implementation of a linked list.

Module 2 Contents Applications of nonlinear data structures in problem solving – (a) search application using hash tables, (b) sparse polynomial manipulation : addition; subtraction; Multiplication, (c) design of buffer cache for disk blocks.

Hash Table Data Structure • This data structure is very effective for applications that have the following characteristics. • Population of data items is enormously large. • An application typically uses a subset of the polulation that is much smaller. • The size of the subset and its contents are dynamic and may vary significantly between different runs. • An example : consider the population of variable names that can be chosen in a programming language like C++.

Hash Table Data Structure • The following simplifying assumptions are made which are more restrictive than reality. • A variable name is at most 32 characters long. • The character set is {a,b, ...., z, A, B, ...., Z, 0, 1, ..., 9, _ (underscore)} which has 63 symbols. • Q. What is the size of the population of all distinct variable names in C++ under the assumptions given above ? • Q. How many variables are typically used in user programs ? • Q. How many variables even in large applications, such as g++, linux OS, etc ?

Hash Table Data Structure • The concepts that one needs to learn to appreciate the internals of this data structure are the following. • Hash functions • Hash table organization • hashpjw() function, an instance of a non-trivial hashing function • bit operations in C++ and their semantics

Hash Table Data Structure • A hash table is an organization of the form given below : • The central idea behind this organization is the existence of a hash function, f, defined over every element of the universe • of values of interest, say set U. • The range and domain of such a function is defined by • f : U -> I such that f(u) = n, • where u belongs to U, and n is an non-negative integer such that 0 <= n < p, • And p is a suitably chosen prime number.

Hash Table Data Structure • The desirable properties of the hash function f are the following • it be easily computable, and • f(u) distributes evenly over I • The program for hash table construction to be discussed uses the method described above, using the hashpjw() function as the hash function. • Example of a hash function : • Consider a set S of 10 integers, • S = { 11, 29, 35, 41, 75, 100, 112, 283, 1000, 1551}. • Choose the hash function f() given by, f(i) = i % 7. • Clearly f() is well defined for all non-negative integers and hence also for S.

Hash Table Data Structure The value of the hash function for the elements of S are given in the following table. The function f() = i%7meets both the requirements of a hash function reasonably well.

Hash Table Data Structure • Among 10 numbers that were hashed, only 3 values {0, 4, 6} were repeated once, while the remaining {1, 2, 3, 5} were all distinct. • A hashing function f(i) = i/2 is not good because it maps even numbers to even and odd numbers to odd which is biased and does not give a good spread of hash function values. • In general, less the repetitions, better is the quality of the hash function. • Two or more elements, v1 and v2 are said to collide when they produce the same hash value, i.e., f(v1) = f(v2) = m (say). • A desirable feature of a good hashing function is that is causes less number of collisions.

Hash Table • The hash function may be used to create a data structure, called a hash table. • Hash table data structure is convenient for working with a dynamic population in which basic list operations – search, insert, delete, order, etc. can be performed efficiently. • It is supposed to be more efficient both in space and time • as compared to a linked list while retaining the dynamic nature • of a linked list (as compared to an array). • Consider a hash table organization for the value set S • given above. Let us create an array, named htable, of size • 5 ( any prime less than the size of the set is fine).

Hash Table • An element of htable[i] has 2 data members, the first an int value and the second a linked list. • Initially, htable is as shown below in Fig. 1.1. All the 5 lists are null which is also indicated by the length of each list to be 0. [0] 0 null [1] 0 null [2] 0 null [3] null 0 [4] 0 null Fig. 1.1 : htable at start

Hash Table • All the 5 lists are null which is also indicated by the length of each list to be 0. • Element 11 is searched in htable. F(11) = 11%5 is 1 and hence the value 11 is looked up in the list at index htable[1]. Being empty, 11 is inserted at the head of the list.2. • For 29, since it is not present in the list at htable[4], it is inserted there. Similarly 35 is inserted in the list at htable[0]. The htable at this stage is shown in Fig 1.23. • For 41, the list at htable[1] is examined. After traversing the single element list (cost 1), 41 is inserted at the head of list at htable[1].

Hash Table • Similarly 75, 100 are inserted at htable[0], 112 at htable[2], 283 at htable[3], 1000 at htable[0] and 1551 at htable[1]. • The final configuration of htable[ ] is shown in Fig. 1.3. • Simple Calculations may be performed to compare the cost of using a hash table against an array or a single linked list. • It is assumed that any operation that is independent of the size of the collection is 1, i. e., constant time is Order of 1, or O(1)). • The numbers pertain to the example above with set of values S.

Hash Table [0] 4 null 35 100 1000 75 [1] 3 null 41 1551 11 [2] 1 112 null [3] null 1 283 [4] 1 null 29 Fig. 1.3 : htable after processing all elements of S

Hash Table

Hash Table • For an array representation, access is direct, so read or write from / to an array is of cost 1. • Assuming that array is unsorted, the cost of finding an element depends on whether the element is present in the array or not. • Absence is detected after examining all the elements which amounts to cost of 10. If present, the cost is variable i, where 1<= i <= 10, where i is the position where the element being searched is located in the array. • For a linked list representation, the time to find has the same behaviour as that of an array. • The cost of read / write of an element depends on first locating it followed by the i/o operation, hence has the same cost as that of find.

Hash Table Organization Insertion at the head requires constant number of pointer manipulations, cost 1, but after incurring the cost of a find, which explains the entries for insert (and also for delete by a similar argument). For hash table organization, it is assumed that all linked lists are roughly of the same size. In our example, the sizes of the linked lists vary from 4 to 1. The mean value in this part of the table assumes that all numbers are equally likely.

Hash Table Organization Comparing the worst-case and mean values of the three representations, we find that the hash table organization meets its mandate quite satisfactorily. I Hashing Functions The literature on hash functions is fairly rich and lot of functions have been proposed in this topic starting from the early 70s. The mathematical properties of various hash functions are also available. While the simple hash function, f(i) = i % p, p a prime number, works satisfactorily, it has limited applicability as the modulus operator % works only for integral values.

Hash Table Organization We introduce a special hash function, known as the hashpjw() function, which has been shown to be work reasonably well in practice. The function is examined here. Efficiency often requires good use of special features of a language. The function hashpjw() uses bit operators of C/C++. A brief overview of the bit operators and their use in manipulation of values is included for completeness.

Bit Operators in C / C++ We introduce a special hash function, known as the hashpjw() function, which has been shown to be work reasonably well in practice. The function is examined here. Efficiency often requires good use of special features of a language. The function hashpjw() uses bit operators of C/C++. A brief overview of the bit operators and their use in manipulation of values is included for completeness.

Bit Operators in C / C++ The bit operators in C++ with their attributes are given in the following table.

Bit Operators in C / C++ The remaining bit operators in C++ with their attributes listed below.

Bit Operators in C / C++ • Every operator has 3 attributes. • Arity gives the number of operands an operator consumes • Associativity specifies the evaluation order to be used for processing of an expression with 2 or more contiguous instances of the same operator. • Precedence specifies the evaluation order to be used for processing an expression with 2 or more instances of different operators. • The semantics of a few bit operators { ~, <<, >>, &, | } in C++ are explained here..

Bit Operators in C / C++ The bit-wise OR (|) and AND (&) operators have the following semantics. Recall that there are only 2 distinct bits, 1 and 0 bitwise OR ( | )

Bit Operators in C / C++ • Semantics of bit-wise AND (&) and XOR (^) • bitwise AND(&) • bitwise XOR (^)

Bit Operators in C / C++ • Semantics of bit-wise NOT (~) • Shift Operators on Bits : Right shift operator (>>) • Consider the bit string, of length 16, whose integral value is 106 0000 0000 0110 1010

Bit Operators in C / C++ • if we perform the operation 0000 0000 0110 1010 >> 2 • The bit string is shifted to the right by two units, the 2 least significant bits (10) are pushed out and 2 bits 00 are pushed in from the left end. • This yields the string given below whose integral value is 26 • The effect is that of dividing 106 by 4, where division is the integer division operation. 0000 0000 0001 1010

Shift Operators in C / C++ • In general, the semantics of x >> n is same as where x/2n where / is integer division • Left shift operator (<<) • Consider again the bit string, of length 16, the integral value is 106 • and we perform 0000 0000 0110 1010 << 2 • The bit string is shifted to the left by two units, the 2 most significant bits (00) are pushed out and 2 bits 00 are pushed in from the right. 0000 0000 0110 1010

Shift Operators in C / C++ • The resulting string is given below whose integral value is 424. • The net effect is that of multiplying by 4 or 22 • In general, the effect of x << n is same as x*2n provided the bits lost from the left end are not '1's. 0000 0001 1010 1000

Programming with Bits • Operations on bits • There is no built-in bit data type in C++ and therefore the semantics given above does not work at the bit level. • However the bit operators of C++ can be operated on data that are of size 1 byte, 2bytes, 4 bytes. • The semantics of these operations are explained with the help of a few examples. For purpose of illustration, we use 8-bit numbers.

Programming with Bits Operations on bits

Programming with Bits • One can always write small programs to test out and improve understanding of the use of bit operators. • Consider the problem of finding the bit pattern stored in 4 bytes. We would like to see the bits as they are placed, without any manipulation of the value stored there. • Let us illustrate with an example. • Suppose the bit pattern stored in 4 consecutive bytes (32 bits) happen to be : 0000 0000 0000 0000 0000 0111 1101 0000 the corresponding integer is 2000. • We would like a program that gives the internal bit string for the integer 2000.

Programming with Bits To write a program that shows the internal bits, we need to find answer to the following question. Q. Given a bit pattern of 8 bits, named as, say n, what is the bit stored in position, say i, where 0 <= i <= 7 ? The trick is to create a special bitstring, also of length 8, that has a '1' at position i and '0' elsewhere. 0 7 i x

Programming with Bits we shall call this bitstring as a mask, because this bitstring focuses on a particular bit, while ignoring or covering the others. The required mask for bit position i. Now, if we perform the operation n & mask, where & is bitwise AND, what happens ? 0 7 i 0 0 0 1 0 0 0 0 0 0

Programming with Bits • Operation to be performed : n & mask • n&mask differs from mask only in the i-th bit, as all other bits are zero in both bitstrings by design. • Therefore, if the expression, (n&mask == mask) evaluates to true, then the i-th bit must be '1' and '0' otherwise. 0 7 i 0 0 0 1 0 0 0 0 0 0 n mask &

Programming with Bits • The key idea of the solution hinges on how easily we can create the 8 masks, given below, to find the bit at each position from 0 to 7. • 0000 0001, 0000 0010, 0000 0100, ...., 0100 0000, 1000 0000 • We have just one bit set to 1 while all others are fixed at 0. Examine the expressions, where 1 is the 8-bit bit pattern, 0000 0001 • 1 << 01 << 1 1 << 2 ...... 1 <<7 • 00000001 00000010 00000100 10000000

Programming with Bits int main() { int num, j; int mask; bool bits[32]; // to store the bits of the input for ( j = 0; j < 32; j++) bits[j] = false; cin >> num; cout << endl; for ( j = 0; j < 32; j ++ ) { mask = 1 << j; // create the mask with 1 at j if ( (num & mask ) == mask ) bits[j] = true; };

Programming with Bits • // lay out the bit positions in order from 31 to 0 • cout << "The bit positions of " << num << " are : "; • for ( j = 31; j >=0 ; j--) cout << bits[j] ; • cout << endl; • } • Q. Does the program given above does its intended task ? • Q. Is the program above capable of finding the bit string representation for a scalar value of any built-in type of C++ ?

hashpjw() • This function uses bit-wise operators, &, << and ^ ( exclusive OR) to compute the hash value, and hence is computationally efficient. • The algorithm used is better understood by examining the for-loop. • The loop runs as many times as the length of the input key (viewed as a string). • Using value of an unsigned integer h at entry of the loop, and the character s[i] of the string s at position i, a new value of h is computed at the end of iteration i of the loop.

hashpjw() • The expression, h = ( h << 4 ) + s[i], multiplies the current value of h by 16 (h<<4), adds to it the int value of s[i] to produce a new value of h. • The bit-wise AND (&) of h with 0xf0000000. • A byte stores two hexadecimal digits, as 4 bits are required to store a hex digit (0 through 15 or [0-9, a-f]). • The hex number 0xf0000000 expressed as a 32 bit number is 11110000 00000000 00000000 00000000. • The expression h&0xf0000000 produces a 32 bit number whose most significant 4 bits (MSB) are the same as the 4 MSB of h while the remaining 28 bits are set to 0.

hashpjw() • The if-condition assigns this value to another unsigned int g. • Unless h is large enough (at least 1 of the 4 MSB is 1), g is assigned zero and the expression in the then-part of if is not executed. • When h is large, “ h = h ^ ( g << 24 ); h = h ^ g;” results in randomizing the value of h. • To summarize, the statements in the for-loop use the characters of the input string to generate a random integer h that is hopefully independent of the input. • The value of h after the completion of for-loop is subjected to modulo 211 to generate an index of the hash table which is the destination of the input key s.

hashpjw() • The ascii codes for small case letters are given below for reference. • 97 char is : a 98 char is : b 99 char is : c • 100 char is : d 101 char is : e 102 char is : f • 103 char is : g 104 char is : h 105 char is : i • 106 char is : j 107 char is : k 108 char is : l • 109 char is : m 110 char is : n 111 char is : o • 112 char is : p 113 char is : q 114 char is : r • 115 char is : s 116 char is : t 117 char is : u • 118 char is : v 119 char is : w 120 char is : x • 121 char is : y 122 char is : z

hashpjw() Execution of hashpjw() with the input string “programming” yields the following trace.

Hash Function hashpjw() hashpjw(142884935) = 677179 * 211 + 166 symbol is 'programming' hash value is 166 A program for creating and searching in a hash table is required to use this data structure for problem solving. A class for linked list is also required. The linked list template class developed in Module1 is reused here. // This design is for a template list class // Can be used to create lists of built-in types #include <iostream> #include <string> #include <fstream> using namespace std;

Hash Table Program template <class T> class list { class node { public: T val; node * next; }; public : node *head; list() { head = 0;} // constructor bool isempty () { if ( head == 0 ) return true; else return false; }

Module 2 Non Linear Data Structures and Applications