Tables

Tables

Chapter 10 • Tables • Overview • Direct access to data via a key, through the Table ADT, implemented as a hash table.

Chapter Objectives • 1. The Table ADT provides direct access to data indexed by a key. • 2. Hashing techniques support very fast retrieval via keys. • 3. Two approaches to hash tables: linear probing and chained hashing.

Retrieval by key • A common programming task is to look up an item in a list • We have already seen some simple ways of doing this • Linear search • Binary search • They differ in terms of data requirements and search efficiency

The Table ADT • Tables are data structures that allow data to be retrieved directly. • They may be multidimensional (unlike lists) • They may be implemented in an array form or using arrays of linked lists (or even linked lists of linked lists).

The Table ADT • Implementation independent • Will be used later with array and linked implementations

Table ADT: characteristics • Characteristics • •A table ADT T stores data of some type (tableElementType) with an associated key (tableKeyType).

Table ADT: operations • Operations • bool T.lookup(tableKeyType lookupKey, • tableElementType & data) • Precondition: None • Postcondition: If lookupKey equals a key in the table, the value of data is set to the data associated with that key; otherwise, the value of data is undefined. • Returns: true if and only if lookupKey equals a key in the table

Table ADT: insert() • void T.insert(tableKeyType insertKey, tableElementType insertData) • Precondition: None • Postcondition: insertData and associated insertKey are stored in T, i.e., T.lookup(insertKey, data) == true, and the value of data will be set to insertData upon return. • Note: The Postcondition implies that if insertKey duplicates an existing key within the table, the data associated with that key is replaced by insertData.

Table ADT, deleteKey() • void T.deleteKey(tableKeyType deleteKey) • Precondition: None • Postcondition: T.lookup(deleteKey, data) will return false

Table ADT • Linear implementation • // for an array implementation, // need a max table size • const int MAX_TABLE = 100;

Table Class: public section • template < class tableKeyType, class tableDataType > • class Table • { • public: • Table(); // Table constructor • bool lookup(tableKeyType lookupKey, tableDataType & data); • void insert(tableKeyType insertKey, tableDataType insertData); • void deleteKey(tableKeyType deleteKey);

Table ADT, private section • private: • // implementation via an unordered array of structs • struct item { • tableKeyType key; • tableDataType data; • }; • item T[MAX_TABLE]; // stores the items in the table • int entries; // keep track of number of entries in table • int search(tableKeyType key); // an internal routine for searching table • };

The classic ‘lookup table’ 15796 foobars 0 Product: Widget Product code: 11234 The data key is the product code and the data itself is the item name 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . We also keep track of the number of entries in the table. . . . entries 9998 4 9999

Hash Table, linear version • template < class tableKeyType, class tableDataType > • Int Table < tableKeyType, tableDataType > • ::search(tableKeyType key) • { // internal routine for implementation -- searches in table for the key -- if found, returns its position; • // else it returns the current value of "entries" -- which is the index 1 past the last item in the table • int pos; • for (pos = 0; pos < entries && T[pos].key != key; pos++) • ; • return pos; • }

Search() 15796 foobars 0 Product: Widget Product code: 11234 Returns the location of where The search item was found 17556 dohinky 1 11234 widget 2 14322 Etc. 3 . . . If the item is not found search returns a value equal to entries. . . . entries 9998 4 9999

Table constructor • template < class tableKeyType, class tableDataType > • Table < tableKeyType, tableDataType >::Table() • { • entries = 0; • }

Insert() • template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::insert(tableKeyType key, tableDataType data) • { • assert(entries < MAX_TABLE); • int pos(search(key)); // set pos to search results • if (pos == entries) // new key • entries++; • T[pos].key = key; • T[pos].data = data; • }

Insert() 15796 foobars 0 Product: dealybob Product code: 18452 Search returns 4 17556 dohinky 1 11234 widget 2 14322 etc. 3 18452 dealybob 4 . . Insert item in array[4] And add 1 to entries . . . entries 9998 5 9999

lookup() • template < class tableKeyType, class tableDataType > • bool Table < tableKeyType, tableDataType > • ::lookup(tableKeyType key, tableDataType &data) • { int pos(search(key)); // set pos to search results • if (pos == entries) // not found • return false; • else { • data = T[pos].data; • return true; • } • }

deleteKey() • template < class tableKeyType, class tableDataType > • void Table < tableKeyType, tableDataType > • ::deleteKey(tableKeyType key) • { • int pos(search(key)); // set pos to search results • if (pos < entries) { // otherwise, not found, so do nothing • // copy last entry into this position • --entries; • T[pos] = T[entries]; • } • }

deleteKey() 15796 foobars 0 Product: widget Product code: 11234 Search returns 2 17556 dohinky 1 11234 widget 2 14322 Etc. 3 18452 dealybob 4 . . Item in array[2] is written over by last item in array. Subtract 1 from entries . . . entries 9998 4 9999

Problems • This lookup table has one big problem. • Every time we wish to find something in it we must perform a linear search. • This is a O(n) • For many problems it would be too slow.

Search Methods • Recall what we know about searching: • Method Time Drawbacks • Sequential O(n) Slow for large n • Binary O(log2n) Data must be contiguous Inserts and deletes are slow Data must be sorted first • Tree O(log2n) Tree must be balanced • Hashing O(1.1) Needs much unused memory • Direct access O(1) Keys must match array indices

The fastest search solutions • The only way we can get O(1) efficiency in searching for Key in an array A is if either • 1. Key is the position (index) of the data in A A[key] • 2. There is a key-to-address transformation of ‘key’ (hashing function) into the index: A[hash(key)]

Direct access • Alternative 1 (direct access) is often impractical because it requires keys to be in the range 0..array_size-1More often, key data is to be accessed by names, social security number, etc.Thus, a hashing function (a key-to-address transformation) is required for most data sets.

Perils of direct lookup • Image your company stores data for their employees by the key field social security number (SSN) • These numbers are 9 digits long. • You would need an array of size 1,000,000,000 to hold all the possibilities from 000-00-0000 to 999-99-9999

Storing SSNs 000000000 To look up the information for an employee with the SSN 467-89-1234 you go directly to a[467891234]! 000000001 . . . 467891234 . . . . 999999998 999999999

Pros and cons of direct lookup 000000000 Advantages: Direct lookup Every employee has their own unique spot in the array 000000001 . . . 467891234 Disadvantages: You only ever use a small portion of the array. The rest of the space is wasted (and there is a lot of it)! . . . . 999999998 999999999

A better solution • A better solution would map all the employees into a data structure without wasting a lot of space. • From the domain of all possible SSNs we want a structure that will store just the range of ones applying to our employees. • To do this we may have to convert the SSNs to something else.

Hash function map

An example of hash conversion • You wish to store product information by product number. The product numbers have 5 digits with the lowest one being 10000. • const int max_size = 10000; • Then we could come up with a simple hash function • hashKey = productCode-max_Size; • This gives us a number between 0 and 9999. • We can use this unique number to directly access the array element containing data for that product.

Product code hash example 0 Product: Widget Product code: 11234 ProductArray[11234 - 10000] 1 . . . widget 1234 . . . . 9998 9999

Another implementation 0 • Or, we could use mod (%) • like this: 1 . . . Product: Widget Product code: 11234 ProductArray[11234 % 10000] widget 1234 . . . . 9998 9999

An internet example • Internet Protocol (IP) uses 32-bit addresses to look up host names. • Example: 63.100.1.17 (4 bytes) • When you want to access a host machine that is not on your network the router takes the host name you give it and looks up the IP address. • It then forwards your request to that host computer.

Problem • Routers need to look up addresses as quickly as possible. • There are millions of IP addresses, so how will it do this? • Linear search? O(n) • Binary search? O(logn) base 2 • Neither of these are fast enough.

Answer • Convert the host name “joe@schmo.net” to it’s IP address using a hash function based on the characters. • Perhaps we could use ASCII codes to generate unique numbers within a certain range.

Issues in hashing • Each hash should generate a unique number. If two different items produce the same hash code we have a collision in the data structure. Then what? • Two issues must be addressed • 1. Hash functions must minimize collisions (there are strategies to do this) • 2. If (when) collisions do occur, we must know how to handle them.

A Collision

What if... 0 • We used mod (%) like • this: 1 . . . Product: Widget Product code: 11234 Product: Whatzit Product code: 12234 ProductArray[11234 % 1000] widget 234 . whatzit . . . 998 999

Two good rules to follow • A good hashing function must • 1. Always produce the same address for a given key, and • 2. For two different keys, the probability of producing the same address must be low.

Picking good hash functions • We want one that spreads values out evenly. (random distribution) • If all the values cluster together • We have more collisions • We waste space in the rest of the data structure

A bad hash function. Why? A SSN of 123-45-6789 is turned into the x value of 123456789 and then converted by this function to the value 1234. Thus, all SSN’s in the company could be mapped into an array of 10,000 elements (indexes 0000 - 9999). This is good, but could lead to what problems?

The problem is... • SSNs are similar for people of approximately the same age, living in the same area of the country. • Employees are likely to have the first four digits commonly the same. • This would cause lots of collisions • At best, it causes clustering

A rule for hash functions • “a good hash function depends on the entire key, rather than just a part.” • It is best if we use all the digits rather than throwing some of them away with integer division. • A better approach would divide by a prime number and take the remainder, to thoroughly mix the digits.

Prime number division • Divide SSN by 10,007 (the smallest prime number > 10,000). • The remainder is between 0 and 10,006 • So we dimension the array as a[10007] • Then compute the table index from the SSN key in our hash function as • Index = key % tableSize;

An example using primes Data keys: 321, 330, 415, 498, 791 Hash function = key % 11

Hash table after first five entries inserted 0 3 3 0 1 2 3 2 1 3 4 9 8 4 5 6 7 8 4 1 5 9 1 0 7 9 1

Linear probing • No matter how good a hash function is, collisions will occur • One method for handling them is ‘linear probing’ • When you hash into the table and find the spot already occupied, you go forward, in linear search fashion until you find a slot that has not been taken. Then insert the item there.

2. so probe forward, adding it at the next free slot Hash table after 365 added 0 3 3 0 1 1. can't add 365 at its home address 2 3 2 1 3 4 9 8 4 3 6 5 5 6 7 8 4 1 5 9 1 0 7 9 1

Tables

Tables

Presentation Transcript

Tables

Tables

Tables

Tables

Tables

Tables

Tables

TABLES

Tables

TABLES

Tables

Tables

Tables

Entity Tables, Relationship Tables

Tables

Tables

TABLES

Tables

Tables

Tables

TABLES

Tables