1 / 24

Sets of Digital Data

Sets of Digital Data. CSCI 2720 Fall 2005 Kraemer. Digital Data . In earlier work with BSTs and various balanced trees, we compared keys for order or equality Here, we take advantage of structure of key Use it as an index, or Decompose string key into characters, or

quiana
Download Presentation

Sets of Digital Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sets of Digital Data CSCI 2720 Fall 2005 Kraemer

  2. Digital Data • In earlier work with BSTs and various balanced trees, we compared keys for order or equality • Here, we take advantage of structure of key • Use it as an index, or • Decompose string key into characters, or • Treat key as numerical quantity on which we can perform operations

  3. Assumptions • We will construct and manipulate sets that • Are drawn from a universe U of size N • U = {u0, …uN-1} • A relatively simple procedure exists by which we can compute, for an element u U, the index i such that u = ui. • Easy if U is set of integers • Also easy if U is set of characters with character codes in a contiguous interval

  4. Bit Vector • Used to represent a subset S U • A table of N bits, Bits[0.. N-1] • Bits[i] == 1 if ui  S • Bits[i] == 0 if ui  S • Example: today’s attendance 0 1 2 3 4 5 6 -- student number 1 1 0 1 0 1 1 1 = present 0 = absent

  5. Bit Vectors • Assume: • determining element index takes constant time • accessing position in table takes constant time • May actually take several ops, and depend somewhat on N(size of universe), but not on size of set represented • Then: • Insert, Delete, Member are constant time ops

  6. Bit Vectors • A subset of a set of size N always takes N bits to represent, independent of size of subset • Makes sense if: • N is not too large • need to represent sets of size comparable to N

  7. Storage Efficiency • Bit Vector vs. Binary Trees • Binary Tree, set of size n • Requires n(2p + K) bits • K >= lg N, size of field to represent key value • p = number of bits in a pointer • Bit Vector, takes N bits • If n  N, then bit vector more efficient • If p = K = 32, then tree becomes more space efficient when n/N  1% • Actually, when n(2p + K) = N, which is when n/N = 1/96

  8. When to use Bit Vectors? • When universe is relatively small • When sets are large in relation to size of universe

  9. Advantages of Bit Vectors • O(1) implementation of Insert, Delete, Member • Union and Intersection easy • Implement via Boolean and and or operations • May actually take less than one op/element, as operations are performed on full machine word • If machine word == 32, then one machine operation handles 32 potential elements of set

  10. Disadvantages of Bit Vectors • On some computers access to individual bits can require shifting and masking operations (expensive) • Result is that Member may be much more expensive than Union • Initialization takes (N) -- zero all the bits in the vector • But can use constant time initialization algorithm • But that makes storage requirement go to 2p + 1 bits per element • So, in practice, just use machine ops to set to zero, which are efficient

  11. Tries and Digital Search Trees • If the key can be decomposed into characters, then the characters of the key can be used as indices • Tries are based on this idea • “trie” is the middle symbol of retrieval, a pun on tree, but pronounced “try”

  12. Tries • Assume k possible character values • A trie is a (k+1)-ary tree • each node a table of k+1 pointers • One pointer for each possible character • One for the end of string character, 

  13. Trie Example

  14. Tries • Path for key of m characters is length m, with pointer at  • Don’t need to store key itself .. It is the path followed. • Info field might be pointed to by  element

  15. Tries: Analysis • Let: • n be the number of keys stored in a trie • l be the length(in characters) of the longest key • s be the number of nodes in the trie • k be the size of the alphabet • Pro: • Access time is O(l), independent of k, n and s • Con: • Size -- requires (k+1) * s * p bits • Most pointers are null, so lots of wasted space

  16. Strategies for reducing storage requirements of tries • Implement a k-ary trie with m nodes as a 2-D, m by k table A B C D E … M …. P …. T ….  0 1 2 3 4 5

  17. Table approach • Number the nodes in the diagram of slide 13 from 1 to m • The table entry corresponding to jth child of ith node is the index of the child node • How does that save space? Just as many nodes and elements as on slide 13 • … need only ceil(lg(m)) bits to represent, smaller than a pointer …

  18. Patricia Tree:Another strategy for reducing space in a trie • Patricia tree • Practical Algorithm to Retrieve Information Codedin Alphanumeric • Eliminate nodes with only one nonempty child • Can now skip right from T to  in TURING in our example • Skip from MA …. To E or  in the MENDEL , MENDELEEV chain • But need to store with each node the index of the character on which it discriminates • And need to store the key itself at the leaf

  19. Patricia tree

  20. de la Briandais trees • Another strategy to save space vs. standard tries • Use a linked list instead of a table at the node level • Each pointer labeled with the character it indexes • longer search time than tries; depends on size of character set • saves significant amounts of memory

  21. de la Briandais

  22. Another strategy … • Use tries at the first few levels • Use ordinary BSTs or de la Briandais at the lower levels • reasoning: • speed advantage at the top, but not too much extra memory required • save space at lower levels

  23. Digital Search Trees • Treat keys as bit strings • (strings over the alphabet {0,1}) • Binary tree – search directed left on 0, right on 1 • Each node contains not only two pointers, but also contains a key that matches that string prefix • Compare for equality before searching left or right • If frequencies are known, store higher frequency keys nearer root • Can be grown dynamically • Expected Search time: O(log n)

  24. Digital Search Tree

More Related