1 / 13

Storage and Retrieval Structures

Storage and Retrieval Structures. by Ron Peterson. Overview. Storage & Retrieval as an ADT Simple implementations Arrays of records Sorted arrays Trees Efficiency issues Hash tables. S & R ADT. A container with a bunch of records Each record has a “key” field Operations: Add a record

Download Presentation

Storage and Retrieval Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage and Retrieval Structures by Ron Peterson

  2. Overview • Storage & Retrieval as an ADT • Simple implementations • Arrays of records • Sorted arrays • Trees • Efficiency issues • Hash tables

  3. S & R ADT • A container with a bunch of records • Each record has a “key” field • Operations: • Add a record • Remove a record by key • Find a record by key, retrieve a copy

  4. Simple Implementations • Array of records • Insert at end, • Find by linear search • Sorted array • Insert in position order, • Find by binary search • Trees and balanced trees • We’ll study this later

  5. Efficiency Issues • Regular arrays – O(N) retrieval • Sorted arrays – O(log N) retrieval, • but O(N) add (Insertion) • Trees – O(log N) retrieval & add • but backup & degenerate tree issues • Balanced trees – O(log N), • but complex & backup issues • Alternative: Hash table – O(C) or close

  6. Hash Table Motivation • How about if we used an array, • but every record had a unique location? • For example, we have an array of employee records, but the key is Employee-ID which goes from 1 to 300 • Employee 17 gets put in location 17 • Add and retrieve are each O(C) • Problem: what if SSN is the key?

  7. The Hash Table Solution • For SSN as the Employee-ID • (as might be needed for Payroll) • One slot per 9-digit ID would require an array of one billion slots; not feasible! • Instead, let’s still have an array of 300 (or a few more) slots and then figure out: • An easy “mapping” function: • LocationIndex = Hash(SSN)

  8. Hash Table Issues • Coming up with a Hash function • Easy to calculate • Result in correct range • Minimize duplicate answers • Duplicates (“collisions”) inevitable • Many-to-one function (keys to location) • Need a plan for dealing with it • “collision handling”

  9. Collision Handling • When adding a record, and a record with a different key is in the location given by the Hash function; • And when retrieving any record that collided when added; • You need to use the same process of what to do next.

  10. Collision Handling Methods • Just increment the location until you find an empty slot (or the key sought) • Called “linear probing” • Provably a bad choice because it tends to create filled up blocks! • Jumps of increasing size (+ wrap-around); • Most common version is “quadratic probing” • Using an overflow area with links

  11. Hash function approaches • Numeric key: just use mod: • Hash(key): return key%Size • Non-numeric key: do a weighted sum of the ASCII codes of the characters: • Char[1] + 5*Char[2] + 17*Char[3] • Then Sum%Size • Special care is usually taken to avoid non-uniformity in distribution of keys.

  12. Design of a Hash Table • Choose a size that leaves room for growth and turn-over (employees leaving?) • Add, Remove, and Find all use the same • Hash function • Collision handling, so • Write a Hash function • Choose & implement a collision handling method

  13. A Few Final Issues • If you run out of slots, you might need to rebuild the whole table with a bigger size. • The size is often chosen as a prime number so that cyclicity in the distribution of keys has the least effect. • New approaches to collision handling are continually being studied. • Hashing to pointers to linked lists can be very effective if the Hash function is good.

More Related