1 / 62

Persistent data structures

Persistent data structures. Yoav Rubin. About me. Software engineer in IBM Research, Haifa Worked on From large scale products to small scale research projects Domains Software tools Development environments Simplified programming Technologies Frontend engineering Java, Clojure

eavan
Download Presentation

Persistent data structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Persistent data structures Yoav Rubin

  2. About me • Software engineer in IBM Research, Haifa • Worked on • From large scale products to small scale research projects • Domains • Software tools • Development environments • Simplified programming • Technologies • Frontend engineering • Java, Clojure • Lecture the course “Functional programming on the JVM” in Haifa University {:name “Yoav Rubin, :email yoavrubin@gmail.com, :blog http://yoavrubin.blogspot.com, :twitter @yoavrubin}

  3. Roadmap • Why • What • How

  4. Why

  5. Few assumptions • Modern software uses different kinds of data • Modern software requires various ways to work with data • We’re in a multi-core world • Mutability and concurrency don’t get along

  6. Modern software Working with different kinds of data + Requiring different ways to work with data + Immutability Persistent data structures Data structures Concurrency and mutability don’t get along

  7. What

  8. What is a data structure • A way to organize data • Provides contracts for read / find • Provides contracts for update • Adding data elements • Removing data elements • A data structure may contain other data structures

  9. What’s in a contract • Information given by the requester • For reads: • Nothing • Some identifier of the data • For writes • The data element itself • The data element alongside additional info • What is returned • For reads: • The data elements / not-found-identifier • For writes: • Nothing • Some identifier of the data • The data structure itself • The cost of the operation

  10. E.g., • Hashtable • add(data-element, key) , O(1) • read(key), O(1) • find(data-element), O(n) • Balanced search tree • add(data-element), O(logn) • find(data-element), O(logn) • Remove(data-element), O(logn)

  11. E.g., • LIFO • push(data-element), O(1) • pop(), O(1) • contains(data-element), O(n) • FIFO • enqueue(data-element), O(1) • dequeue(), O(1) • contains(data-element), O(n)

  12. What is the data in a data structure • Values • Or other data structures • At the leaves – only values • A value is something that cannot change • 7, \a , nil, “abc”

  13. Is a data structure a value • In a mutable world - No • In an immutable world – Yes • It cannot change !!!

  14. Persistency • A persistent data structure is a data structure • That acts as a value • Cannot be changed • While providing contracts for reading and writing • Without affecting other users of the data structure

  15. Writing to an immutable structure • You don’t write to an immutable structure • Copy-on-write? • Breaks performance guarantees • Maintaining persistency • Create a structure, that is perceived as updated by the update requester • Perceived as immutable by everyone else

  16. Writing “together with” • There are two “participants” in the write • The “write-to” structure • The added value • Extract as much as possible from the “write to” structure • Shared part • Non-shared part • Complete the new structure with the added value and duplications of the non-shared part

  17. How

  18. Persistent data structures • Persistent list • Persistent vector • Persistent map

  19. Persistent list - conj list1 a b c d (def list2 (conj list1 x)) list1 a b c d x Note that list2 uses all of list1 list2

  20. Persistent list – next (/ pop) list1 a b c d (def list3 (next list2)) list1 a b c d x Note that list3 is list1 list2 list3

  21. Persistent list • Complete structural sharing upon “modification” • Write (conj) – just adding the new value • Delete (pop) – returning a pointer to the next element

  22. Persistent lists • O(1) for insertion (at the front) • O(n) for going over the list • O(1) for popping

  23. Persistent vectors and maps

  24. Behind the scenes • A trie • A tree in which each node has an “alphabet map” to route the navigation to the children • Over the alphabet of 0-31 • Vector – a balanced dense trie • Maps – sparse trie

  25. Why trie • Trie allows holding both data and metadata of an element • The data is at the leaves • The metadata is the derived from the structure of the path to the leave • Deriving information from the structure is a very powerful mechanism • Neural networks work that way • In persistent vectors the metadata is the index • In persistent maps the metadata is the hashcode

  26. Persistent vectors • Values are at the leaves • Each level can hold up to 32level elements • If passed that number, a new level is created, and the previous level is pointed by entry 0

  27. 1 2 1 2 3 1 4 7 10 Adding elements 1 2 3 5 6 8 9 11 12

  28. Finding an element • Looking at the bit representation of the index • Each 5 bits correspond to a position in a specific level • We know the trie’s height • The height tells us which bit quintet to start the find process with

  29. At the top level we would look here At the next level we’d look here At the leaves level we’d look here At the top level we would look here At the next level we’d look here At the leaves level we’d look here Finding an element - example • Assume that the trie height is 3

  30. Persistent vector • Very efficient • Almost O(1) • Find, add • O(near-constant-time) • O(log32n) • A very strong narrowing factor • 1M elements => need to handle 4 levels • For all practical uses – think of it as O(1) • Subvec – O(1) • No dependency in the size of the sub vector !!

  31. Persistent map • A special trie - Hash Array Map Trie • Based on the work of Phil Bagwell • Over the alphabet of 0-31 • Not dense • Similar to the way Persistent vector map works • Instead of indices, we use a 32 bit hash value of the key

  32. What happens when modifying* • We want to do (assoc m x v1) • The big arrows represent the not participating sub tree (each arrow can be several real arrows to several nodes) n x v

  33. r’ n’ x’ v1 (assoc m x v1) m’ m r n x … v

  34. Path copying • Performing an action on a structure does not creates an entirely new structure • A new structure is created and share a large portion of the old structure

  35. What should be created • Anything that is related to the new node • Remember – information about that node is found at the leaf (the content) and on the path (the metadata) • Recreating the path to the changed node and the changed node

  36. Path copying - price • The amount of nodes need to be created is O(tree-height) • using a very wide tree • 32 children for each parent • Tree-height is log32n • n is the number of nodes in the tree

  37. Performance concerns • Very sparse structure • Each node should only point to existing children • Not be structured as a full node • Still, need to locate child in an efficient way • Constant time

  38. Processing within a node • Each node holds the following fields • A list of up to 32 “links” that point from the parent node to a child • Less cache misses • Bitmap of 32 bits (an integer) • 1 in the ith bit means that an entry whose value in the segment is i is pointed by the links list • Its index in the list is the number of ones to the right of ith bit • A very sparse tree – most of the links lists would be very small

  39. Example • A node is pointed by entry 0 in level 0 and has 4 children • In the positions 2, 5, 14 , 22 • How to assoc keys with the following hash codes • 00 00000 00000 00000 00100 00000 • 00 00000 00000 00000 01110 00000

  40. The node (in level 1): 00 00000 00100 00000 10000 00001 00100 2 5 14 22

  41. First hash code: 00 00000 00100 00000 10000 00001 00100 2 5 14 22 00 00000 00000 00000 00100 00000 First entry in level 0, brought us to this node

  42. First hash code: 00 00000 00100 00000 10000 00001 00100 2 5 14 22 00 00000 00000 00000 00100 00000 Looking for entry 4 in this level, checking the bit map and seeing that it has 0 there, therefore returning false

  43. Second hash code: 00 00000 00100 00000 10000 00001 00100 2 5 14 22 00 00000 00000 00000 01110 00000 First entry in level 0, brought us to this node

  44. Second hash code: 00 00000 00100 00000 10000 00001 00100 2 5 14 22 00 00000 00000 00000 01110 00000 Looking for entry 14 in this level, checking the 14th bit in the map. There’s 1 there. Counting the number of 1s to the right of that bit – getting 2 Continue on the link in cell 2 (remember – zero based indexing)

  45. How many actions taken • Looking at the ith bit • A simple masking • Counting the number of 1s • Using the processor instruction CTPOP • Population counting in a bit • Found on many processors • Counts the number of ones • Masking and counting • A constant amount of bit operations

  46. That’s not all • We traveled down the link • If there’s a node with map there • Continue the same with the next segment • If there’s a node with a value there • Return that value

  47. Disclaimer • There are other ways to implement the internal processing within a node • This specific way is a simplification of the process that was presented in the original “Ideal Hash Trees” paper • Which was once used in Clojure • Was replaced with several more optimizations

  48. Why trie • A linear structure – would need to copy the entire structure • Too much performance hit • O(n) time for each action • Too much memory consumption • Path copying results in needing to create a few nodes for each change • the wider the trie, the less node • 32 children for parent is a very wide trie

  49. Zippers

  50. Zippers • Generic tree handling API • Walking • editing • Purely functional data structure • Persistent • Excellent performance • Found in clojure.zip • First described by Gérard Huet in 1997

More Related