1 / 107

Arrays and Strings

Arrays and Strings . CSCI 2720 University of Georgia Spring 2007. The Array ADT. Stores a sequence of consecutively numbered objects Each object can be accessed (selected) using its index. More formally …. Given integers l and u with u >= l-1,

nansen
Download Presentation

Arrays and Strings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arrays and Strings CSCI 2720 University of Georgia Spring 2007

  2. The Array ADT • Stores a sequence of consecutively numbered objects • Each object can be accessed (selected) using its index

  3. More formally …. • Given integers l and u • with u >= l-1, • the interval l ..u is defined to be the set of integers i such that l <=i<=u • An array is a function • from any interval (the index set of the array) • to a set of objects or elements • the value set of the array

  4. Formally, continued … • If X is an array and i is a member of its index set, • We write X[i] to denote the value of X at i • The members of the range of X are known as the elements of X

  5. The Array ADT • Access(X,i) • Length(X) • Assign(X,i,v) • Initialize(X,v) • Iterate(X,F)

  6. Access(X,i) • Return X[i]

  7. Length(X) • Return u – l + 1, the number of elements in I (the interval on X)

  8. Assign(X,i,v) • Replace array X with a function whose value on i is v (and whose value on all other arguments is unchanged). • We also write this as: • X[i] <- v

  9. Initialize(X,v) • Assign v to every element of array X

  10. Iterate(X,F) • Apply F to each element of array X in order, from smallest index to largest index. • F is an action on a single array element. • for i = l to u do F(X[i])

  11. String • A special type of array • If  is any finite set, then a string over  is • an array whose value set is  and whose index set is 0..n-1 for some non-negative n • The set  is called an alphabet • Each element of  is called a character •  often consists of the Roman alphabet, plus digits, the space, and common punctuation marks

  12. Strings • If w is a string, then • Length(w) = n • Also written |w| • If w = TREE, then • w is a string of length 4 • w[0] = T, w[1] = R • The null string is the string whose domain is the empty interval • Has no elements • Written 

  13. String-specific operations • Substring(w,i,m) • Concat(w1,w2)

  14. Substring(w,i,m) • w is a string; i,m integers • Returns the string of length m containing the portion of w that starts at i • Formally: • returns a string w’ with indices 0 .. m-1 such that w’[k] = w[i+k] for each k satisfying 0 <=k <=m • only applies if • 0 <= i <= |w| and • 0 <= m <= (|w| -1) • otherwise, returns 

  15. Substring … • Example: w = SNICKERING • Substring(w,2,3) returns ICK • Substring(w,3,0) returns  • Substring(w,10,3) returns  • Prefix • each substring(w,0,j) for 0<= j <= |w| is a prefix of w • Suffix • each substring(w,j, |w| - j) for 0<= j <= |w| is a suffix of w

  16. Concat(w1,w2) • returns a string • of length |w1| + |w2| • whose characters are the characters of w1 followed by those of w2 • Concat(w,) = Concat(,w) = w • Example: • w1 = BIRD, w2 = DOG, • Concat(w1,w2) = BIRDDOG • Concat(w2,w1) = DOGBIRD

  17. Tables vs. Arrays • Table = physical organization of memory into sequential cells • Array = an abstract data type, with specific operations • Arrays frequently implemented using tables, but may be implemented in other ways

  18. Multi-dimensional arrays • a function whose range is any set V and whose domain is the Cartesian product of any number of intervals • the Cartesian product of intervals I1, I2, …Id, written as I1 x I2 x … Id, is the set of all d-tuples <i1, i2, … id> such that ik Ik for each k.

  19. Multi-D arrays • if C is a multidimensional array and if i =<i1, i2, … id> then C[i1, i2, … id] is the value of C at i • The dimension of a multi-D array is the number of intervals whose Cartesian product makes up the index set • The size of the kthdimension of such an array is the number of elements in Ik

  20. Contiguous Representation of Arrays: Why Computer Scientists start counting at 0 • Store elements in a table: x x+4 x+8 x+12 x+16 x+20 x[0] x[1] x[2] x[3] x[4] x[5] • Each element begins at x + 4(i-1) • x = starting address of the array • 4 = sizeof(element) • i = index of element of interest 17 43 87 94 101 143

  21. More generally • if X is the address of the first cell in memory of an array with indices l..u, and if each element has size L, then • the ith element is stored at address X + L * (i-1) • the element can be retrieved in constant time

  22. When iterating through the array • can save a few operations by doing “pointer arithmetic” • just add L to current address to get next element • don’t have to subtract, multiply, add • still linear in number of elements, but faster linear

  23. Where’s the needed info stored? • Could store L, l, and u at the starting address of X .. but would need to adjust the formula to calculate the location of individual cells. • If language is strongly typed, some or all of L, l, and u may be part of the definition of X and stored elsewhere • C/C++ -- L part of typing info, l assumed to be 0, u not stored (programmer needs to keep track)

  24. Where’s the needed info stored? • Can use a sentinel value after the last element of the array • C/C++ -- we do this with strings. Store a ‘\0’ at the end • means that you need to iterate through to find Length, no longer O(1)

  25. What if the elements have different lengths? • allot Max to all elements • wasted space • can still access in O(1) time • store pointers to elements • pointers require memory • need 2 accesses (calculate location of pointer, then follow it), but still O(1) • pointer to element is at X + P * (i-1) • easy to swap even large or complex elements … just swap their pointers

  26. 2D arrays • can also represent in contiguous memory … but do we keep rows together or do we keep columns together?? • Example: array with logical ordering A B C D E F G H I J K L

  27. A B C D E F G H I J K L A E I B F J C G K D H L Row major v. column-major

  28. Where are 2D elements stored? • Row-major: R[i,j] stored at: • R + L * (NPR(i-1) + (j-1)), where • R is starting address of the array • L is the size of each element • NPR is the number of elements per row • i is the row number • j is the column number

  29. Where are 2D elements stored? • Column-major: C[i,j] stored at: • C + L * (NPC(j-1) + (i-1)), where • C is starting address of the array • L is the size of each element • NPC is the number of elements per column • i is the row number • j is the column number

  30. Multi-dimensional arrays

  31. Constant-time initialization procedure Initialize(ptr M, value v) //Initialize each element of M to v Count(M) <- 0 Default(m) <- v function Valid(int I, ptr M): boolean //return true if M[i] has been modified //since last Initialize return (0 <= When(M)[i] < Count(M)) and (Which(m)[When(M)[i]] == i)

  32. Constant time initialization function Access(int i, ptr M):value // return M[i] if Valid(I,M) then return Data(M)[i] else return Default(M) procedure Assign(ptr M, int I, value v) // Set M[i] <- v if not Valid(i, M) then When(M)[i] <- Count(M) Which(M)[Count(M)] <- i Count(M) <- Count(M) + 1 Data(M)[i] <- v

  33. But requires 3x memory … Which(M) When(M) Data(M)

  34. Sparse Arrays • Definitions • List Representations • Hierarchical Tables • Arrays with Special Shapes

  35. Sparse Arrays • some arrays contain only a few elements … wouldn’t it be more efficient to store only the non-null values? same idea when only a few values differ from the majority • some arrays have a special shape … upper diagonal matrix, symmetric matrix • sparse array : an array in which only a small fraction of the elements are significant in some way • null element: doesn’t need to be stored; is either actually null, or well-known, or easily calculated

  36. List representations

  37. Hierarchical tables

  38. Upper-triangular matrix

  39. Representation of Strings • Background • Huffman Encoding • Lempel-Ziv Encoding

  40. Representing Strings • How much space do we need? • Assume we represent every character. • How many bits to represent each character? • Depends on ||

  41. Bits to encode a character • Two character alphabet{A,B} • one bit per character: • 0 = A, 1 = B • Four character alphabet{A,B,C,D} • two bits per character: • 00 = A, 01 = B, 10 = C, 11 = D • Six character alphabet {A,B,C,D,E, F} • three bits per character: • 000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101 =F, 110 =unused, 111=unused

  42. More generally • The bit sequence representing a character is called the encoding of the character. • There are 2n different bit sequences of length n, • ceil(lg||) bits required to represent each character in  • if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)

  43. Can we do better?? • If  is very small, might use run-length encoding

  44. What if … • the string we encode doesn’t use all the letters in the alphabet? • log2(ceil(|set_of_characters_used|) • But then also need to store / transmit the mapping from encodings to characters • … and is typically close to size of alphabet

  45. Huffman Encoding: • Still assumes encoding on a per-character basis • Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of strings • requires assigning longer codes to rarely used characters • Problem: • when decoding, need to know how many bits to read off for each character. • Solution: • Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.

  46. A Huffman Encoding Tree 21 0 1 9 12 E 0 1 5 7 0 1 0 1 3 2 3 4 A T R N

  47. 21 0 1 9 12 E 0 1 5 7 0 1 0 1 3 2 3 4 A T R N

  48. Weighted path length Weighted path = Len(code(A)) * f(A) + Len(code(T)) * f(T) + Len(code(R) ) * f(R) + Len(code(N)) * f(N) + Len(code(E)) * f(E) = (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1) = 9 + 6 + 9 + 12 + 9 = 45 Claim (proof in text) : no other encoding can result in a shorter weighted path length

  49. Building the Huffman Tree A 3 T 4 R 4 E 5

  50. Building the Huffman Tree 7 R 4 E 5 A 3 T 4

More Related