Enhancing Computational Thinking with Composite Data Types

CS 3304Comparative Languages • Lecture 11:Composite Data Types • 21 February 2012

Distinguished Lecture Code as a Metaphor for Computational Thinking Owen Astrachan, Duke University Location:Torgerson 2150 Date:Friday, February 24, 2012 Time:11:15am-12:30pm A Meet-the-Speaker session will be held 4:00pm-5:30pm in McBryde 106.

Introduction • Supporting composite data types (arrays, strings, sets, pointers, lists, and files) involves additional syntactic, semantic, and pragmatic issues. • Pointer related issues require a more detailed discussion of the value and reference models of variables and of the heap management issues. • Input and output mechanisms are important when dealing with files.

Records (Structures) • Record types allow related data of heterogeneous types to be stored and related together. • Usually laid out contiguously. • Possible holes for alignment reasons. • Compilers keep track of the offset of each field within each record type. • Smart compilers may re-arrange fields to minimize holes (C compilers promise not to). • Implementation problems are caused by records containing dynamic arrays but we won't be going into that in any detail.

Syntax Examples • C:struct element { char name[2];intatomic_number; double atomic_weight; _Bool metallic;} • Pascal:type two_chars = packed array [1..2] of char;type element = record name : two_chars;atomic_number : integer;atomic_weight : real; metallic : Boolean;} • Java: classes. • The ordering of record fields is significant in most languages. • In ML the ordering is insignificant: tuples are abbreviations for records with field names as small integers.(“Cu”, 29){1 = “Cu”, 2 = 29}{2 = 29, 1 = “Cu”}

Nested Records I • Nested definitions (in C) and the “no-nesting” equivalent:struct ore { struct ore { char name[30]; char name[30]; int atomic_number; struct element double atomic_weight; element_yielded; _Bool metallic; }} element_yielded; • Fortran 90, Common Lisp: no-nesting only. • Naming for the nested record: • Record to field: . in Pascal or C. • Field to record: of in Cobol, # in ML. • Models of variables: • Value: nested records are naturally embedded in the parent record (large fields, word or double-word alignment). • Reference: fields are usually references to data in other locations.

Other Features • Packed records: • Pascal: optimize for space. • Ada, Modula-3, C: more elaboratepacking, bits per field. • Assignment (most): an entire record in a single operation. • Comparison: Ada allows but most languages do not. • Copy/comparison: use library routines (e.g., block_copy) but what about the holes (zeros, customized routines)? • A trade-off between packing (time) and holes (space). • Compilers “re-arrange” the field order: usually not a problem except when dealing with systems programs. • Ada, C++: non-standard alignment. • with statement for deeply nestedrecords.

Unions (Variants) • Unions (variant records): union { • If variables are not used at the same time, int i;they can share the same memory space. double d; • The size of the space is the size of the }largest variable. • Main purpose: • System programs. • Alternative sets of fieldswithin a record. • Problem for type checking. • Lack of tag means you don't know what is there. • Ability to change tag and then access fields hardly better: • Can make fields “uninitialized” when tag is changed (requires extensive run-time support). • Can require assignment of entire variant, as in Ada.

Arrays • Arrays are the most common and important composite data types. • Unlike records, which group related fields of disparate types, arrays are usually homogeneous. • Semantically, they can be thought of as a mapping from an index type to a component or element type: • Index type integer or any discrete type. • Element type scalar (Fortran 77) or any type. • Associative arrays (nondiscrete index types): • Implemented with hash tables or search trees. • Supported by the standard libraries of object-oriented languages.

Array Syntax and Operations • Array element: refer to by appending a subscript delimited by parenthesis (Fortran, Ada) or square brackets (C, Pascal) to the name of array. • Declaring an array: • Appending subscript notation to the syntax used to declare a scalar. • Using an array constructor. • Slice (section): a rectangular portion of an array (Figure 7.4).

Arrays Dimensions, Bounds and Allocation • Global lifetime, static shape — If the shape of an array is known at compile time, and if the array can exist throughout the execution of the program, then the compiler can allocate space for the array in static global memory. • Local lifetime, static shape — If the shape of the array is known at compile time, but the array should not exist throughout the execution of the program, then space can be allocated in the subroutine’s stack frame at run time. • Local lifetime, shape bound at elaboration time.

Descriptors or Dope Vectors • Symbol table maintains dimension and bounds information for every array in the program. • When these values are not statically known, the compiler generates code to look them up in dope vector at run time. • Dope vector contains the lower bound of each dimension and the size of each dimension other than last. • Initialized at elaboration time or whenever the number or bounds of dimensions change. • Assignment (for array) might require copying both array data and the dope vector. • Languages with a value model of variables and arrays of dynamic shape may use dope vectors also for dynamic shape records.

Stack Allocation • Arrays as subroutine parameters: • Early Pascal: required shape to be specified statically. • Standard Pascal: bounds are symbolic names rather than constants. • Conformant arrays arrays with these parameters: very useful in scientific applications (depend on numeric libraries). • Can be passed by reference or by value. • Ada and C99: • Support conformant arrays andlocal arrays of dynamic shape. • Local array shape fixed atelaboration time. • Stack frame is divided into: • Fixed-size: object’s size staticallyknown. • Variable-size: object’s size knownat elaboration time.

Heap Allocation • Fully dynamic arrays: can change shape at arbitrary times – must be allocated in the heap. • If the number of dimensions is statically known, the dope vector and the pointer to the data an be kept in the stack frame of the subroutine in which the array was declared. • If the number of dimensions is dynamic, the dope vector must generally be placed at the beginning of the heap. • Compiler has to reclaim the space occupied by fully dynamic arrays. • Some languages (Snobol, Icon, scripting languages) allow strings to change size after elaboration time. • Java, C#: strings are immutable objects.

Array’s Memory Layout • Contiguous elements (Figure 7.7): • Column major - only in Fortran. • Row major: • Used by everybody else. • Makes array [a..b, c..d] the same asarray [a..b] of array [c..d].

Array Layout Strategies • Two layout strategies for arrays (Figure 7.8): • Contiguous elements. • Row pointers. • Row pointers: • An option in C. • Allows rows to be put anywhere - nice for big arrays on machines with segmentation problems. • Avoids multiplication. • Nice for matrices whose rows are of different lengths: e.g. an array of strings. • Requires extra space for the pointers.

Array Allocations

Accessing Array Elements • A: array [L1..U1]of array [L2..U2]of array [L3..U3]of elem:D1 = U1-L1+1D2 = U2-L2+1D3 = U3-L3+1 • Let:S3 = size of elemS2 = D3 * S3S1 = D2 * S2 • We could compute all that atrun time, but we can make do with fewer subtractions:== (i * S1) + (j * S2) + (k * S3) + address of A - [(L1 * S1) + (L2 * S2) + (L3 * S3)] • The stuff in square brackets is compile-time constant that depends only on the type of A.

Strings • In many languages strings are really just arrays of characters. • They are often special-cased, to give them flexibility (like polymorphism or dynamic sizing) that is not available for arrays in general (Snobol, Icon, scripting languages). • Literal characters, literal strings, escape sequences. • Available operations on strings tied to implementation: • Pascal, Ada: assignment, comparison. • C: only a pointer to a string literal. • Dynamic length strings: • Fundamental to a large number of applications. • It's easier to provide these things for strings than for arrays in general because strings are one-dimensional and non-circular. • Built in type (ML, Lisp) or class (C++, Java, C#) – a string variable is a reference to a string.

Sets • Set is an unordered collection of an arbitrary number of distinct values of a common type. • Pascal: sets of any discrete type. • Icon: only sets of characters. • Python: sets of arbitrary type. • Ada: set package. • C++, Java, C#: standard libraries. • Possible implementations: • Arrays, hash tables, trees. • Bit vectors are what usually get built into programming languages. • Things like intersection, union, membership, etc. can be implemented efficiently with bitwise logical instructions. • Some languages place limits on the sizes of sets to make it easier for the implementor.

Pointers • Pointers serve two purposes: • Efficient (and sometimes intuitive) access to elaborated objects (C). • Dynamic creation of linked data structures, in conjunction with a heap storage manager. • Pointers are used with a value model of variables: • Pointers (high-level concept) are not addresses (low-level concept). • They are not needed with a reference model. • Several languages (e.g. Pascal) restrict pointers to accessing things in the heap: • How and when is storage reclaimed for objects no longer needed? • Many languages require the programmer to explicitly reclaim space: • Memory leak: failure to reclaim space for objects no longer needed. • Dangling reference: reclaims objects that are still in use. • Garbage collection: automatic storage reclamation.

Pointer Syntax and Operations • Operations include: • Allocation/deallocation objects on the heap. • Assignment of one pointer to another. • Functional languages: • A reference model for names, objects allocated automatically. • Imperative languages, for example A := B: • Value model (C, Pascal, Ada): if B refers to an object, B is a pointer and A has to be a pointer to refer to that object. • Reference model (Clu, Smalltalk): always makes A refer to the same object to which B refers. • Mixed approach (Java): • Value mode: built-in primitive data types. • Reference model: user-defined types. • Mixed approach (C#): mirrors Java but provides additional, “unsafe” features when pointers are needed.

Reference Model • ML (static typing) - datatype mechanism:datatype chr_tree = empty | node of char * chr_tree * chr_tree;node (#”Y”, node (#”Z”, empty, empty), node (#”W”, empty, empty)) • Lisp (dynamic typing): • Semantically, each list is a pairof references, on to the headand one to the remainder of things. (#\Y (#\Z () ()) (#\W ()())) • In purely functional languages, the data structures created with recursive types turn out to be acyclic: • New objects refer to old ones while old objects never change. • Circular structures can be defined only using the imperative features. • Mutually recursive types: • ML: types declared together in a group. • Lisp: trivial since it is dynamically typed.

Value Model • Pascal:type chr_tree_ptr = ^chr_tree;chr_tree = record left, right : chr_tree_ptr;val : char end; • Ada:type chr_tree;type chr_tree_ptr is access chr_tree;type chr_tree is record left, right : chr_tree_ptr;val : character;end record; • C:structchr_tree {structchr_tree *left, *right; char val;} • Dereferencing: an explicit dereferencing operator (C) and automatic dereferencing (Ada).

Pointers and Arrays • C pointers and arrays:int *a == int a[] int **a == int *a[] • But equivalences don't always hold: • Specifically, a declaration allocates an array if it specifies a size for the first dimension. • Otherwise it allocates a pointer:int **a, int *a[] pointer to pointer to int.int *a[n], n-element array of row pointers.int a[n][m] 2D array. • Compiler has to be able to tell the size of the things to which you point: so the following aren't valid: inta[][]badint(*a)[]bad • C declaration rule: read right as far as you can (subject to parentheses), then left, then out a level and repeat: int*a[n], n-element array of pointers to integer int(*a)[n], pointer to n-element array of integers

Dangling References • Problems with dangling pointers are due to: • Explicit deallocation of heap objects: • Only in languages that have explicit deallocation. • Implicit deallocation of elaborated objects. • Two implementation mechanisms to catch dangling pointers: • Tombstones: • An extra level of indirection on every pointer access. • When an object is reclaimed, the tombstone is marked to invalidate future references to the objects. • Can be used in languages that permit pointers to nonheap objects. • How to reclaim to tombstones themselves. • Locks and keys: • Add a word to every pointer and to every object in the heap. • These words must match for the pointers to be valid. • Simpler but works only for objects in the heap.

Garbage Collection • Garbage collection: automatic reclamation of objects that are no longer used (difficult to implement): • Essential for functional languages. • Popular for imperative languages. • A classic trade-off between convenience/safety and performance. • Reference counts: each object has a counter for the number of pointers that point to it: • Each pointer must be initialized to null at elaboration time. • The implementation must identify the location of every pointer: relies on type descriptors generated by the compiler. • Useful object: an object may be useless hen references exist. • Tracing collection: A useful object can be reached by following a chain of valid pointers starting from something that has a name (i.e., outside the heap).

Garbage Collection Mechanisms I • Mark-and-sweep is the classic mechanism: • Every block in the heap is marked “useless”. • Starting from all pointers outside the heap, recursively explore all linked data structures and mark newly discovered block as “useful”. • Move the blocks that are still “useless” to the free list. • 1/3: Variable size blocks must start with the indicators of size/free. • 2: Must be able to fin the pointers within each block. • Needs a stack with depth proportional to the heap size: stack and heap grow toward each other so full heap means no stack space. • Pointer reversal embeds the equivalent of the stack in already existing fields in heap block: • As it explores the path to a given block, it reverses the pointers. • Reversed pointer must be marked (usually another bookkeeping field) to distinguish them from forward. • At most one pointer in a block will be reversed at any given time.

Garbage Collection Mechanisms II • Stop and Copy: reduce fragmentation by storage compaction: • The heap divided into two halves: all allocation in the first half. • When full, all useful objects are moved to the second half • Generational Collection: • Most dynamically allocated objects are short lived. • The heap is divided into regions based on the ”age” of objects. • Objects gradually progress to “older” regions (like stop-and-copy): pointers need new values. • Write barrier: a hidden list of old-to-new pointers. • Conservative Collection: • Mark-and-sweep without being able to find pointers. • The heap spans a relatively small number of addresses. • Small probability a non-pointer will contain that address pattern. • Safe if the programmer does not “hide” pointers.

Lists • List is defined recursively: • The empty list. • A pair consisting of an object (list or atom) and another (shorter) list. • Suited for functional and logic languages but also used in imperative languages. • ML list are homogeneous: a chain of blocks, each of which contains an element and a pointer to the next block. • Lisp list are heterogeneous: a chain of cons cells, each containing two pointers, one to the element an one to the next cons cell. • List notation: • ML: enclosed in square brackets, with elements separated by commas, [a, b, c, d]. • Lisp: enclosed in parenthesis, with elements separated by white spaces, (a b c d).

Basic List Operations • Constructing lists from and extracting from components:(cons ‘a ‘(b)) => (a b) a :: b => [a, b](car ‘(a b)) => a hd [a, b] => a(car nil) => ?? hd [] => ??(cdr ‘(a b c)) => (b c) tl [a, b, c] => [b, c](cdr ‘(a)) => nil tl [a] => nil(cdr nil) => ?? tl [] => run-time exc.(append ‘(a b) ‘(c d)) => (a b c d) [a, b] @ [c, d] => [a, b, c, d] • List comprehension (Miranda, Haskellm Python, F#): • Adopted from traditional mathematical set notation. • A common form comprises an expression, an enumerator, and one or more filters. • Example: List of the squares of all odd numbers les than 100: • Mathematical: {i ⨉ i | i ∈ {1,…,100} ∧ i mod 2 =1} • Haskell: [i*i | i < [1..100], I ‘mod’ 2 == 1] • Python: [i*i for i in range(1, 100) if i % 2 == 1) • F#: [for i in 1..100 do if i % 2 = 1 then yield i*i]

Lisp: car and cdr • car gets the first element from a list. • cdr gets the remainder of a list. • The names derived from the original implementation of Lisp on the IBM 704: • The machine architecture included 15-bit “address” and “decrement” fields in some of the 36-bit loop control instructions. • Additional instructions to load an index register from, or store it to, one of these fields within a 36-bit memory word. • The Lisp interpreter designers mimicked the internal format of instructions to be bale to exploit them. • CAR: contents of address of register. • CDR: contents of decrement of register. • Fortran was also developed on IBM 704 (three-way IF): • First commercial machine to include hardware floating-point and magnetic core memory.

Files and Input/Output • Input/output facilities allow a program to communicate with the outside world. • Interactive input/output is very platform specific. • Files: off-line storage implemented by the operating system. • Temporary: exist for the duration of a single program run. • Persistent: exist before the program beings and/or after it ends. • Input/output is one of the most difficult aspect of a language to design and that varies most from language to language. • Built-in file data type and special syntactic constructs for I/O: • Ability to employ non-subroutine syntax. • Ability to perform operations not available to library routines. • Library packages providing a file type and a variety of input/output subroutines: • Keeps the “clutter” out of the language definition.

Equality Testing and Assignment • Primitive data types: relatively straightforward for simple. • Complex/abstract data types: semantic and implementation issues. For example, comparing two character strings: • Are aliases for one another? • Occupy storage that is bi-wise identical over its full length? • Contain the same sequence of characters? • Would appear the same if printed? • Distinction between l-values and r-values, for references: • Shallow comparison: refer to same object. • Deep comparison: refer to equal object, may need recursive traversal. • Imperative languages, a:=b assignment: • Reference model: shallow (same reference) and deep (copy object). • Value model: shallow (copy value but not objects).

Language Implementations • Most programming languages employ both shallow comparisons and shallow assignments. • Some provide more than one option for comparison: • Scheme has three general-purpose equality-testing functions:(eq? a b) ; refer to the same object(eqv? a b) ; semantically equivalent(equal? a b) ; same recursive structure • Deep assignments are relatively rare. • User defined-abstractions - no single language-specified mechanism for equality testing/assignment is likely to work: • Allow the programmer to define the comparison/assignment operators for each new data type. • Allow the programmer to specify that equality testing and/or assignment is not allowed.

Summary • Key issues for records include the syntax and semantics of variant records, whole-records operations, type safety and related memory layout issues. • For recursive data types, much depends on the choice between the value and reference models of variable/names. • Recursive types are generally used to create linked data structures. • Newer languages have improved semantic at the expense of complexity and cost, such as the type-safe variant records (Ada), standard length numeric types (Java, C#), array slicing (Fortran 90), etc.

Enhancing Computational Thinking with Composite Data Types

Enhancing Computational Thinking with Composite Data Types

Presentation Transcript

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages Fall 2011

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages