- 69 Views
- Uploaded on
- Presentation posted in: General

Random Number Generators.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Random Number Generators.

There is no such thing as a "random number generator", just processes that will generate sequences of numbers that will satisfy more and more statistical tests for "randomness".

In some cases the process will generate an unrepeatable sequence - not really desirable from the point of view of creating "repeatable tests", but probably desirable from the point of view of satisfying tests for randomness - while in other cases the sequence will be completely determined by its first element - and still pass an acceptable battery of randomness tests.

Basically, what one wants is a way of producing numbers that appear to be independent draws from the U(0,1) distribution, in that they pass a series of statistical tests.

Characteristics of a good random number generator.

1) The numbers produced should appear uniformly distributed on [0, 1] and uncorrelated.

2) The generator should be fast and not require much memory - the first condition can be satisfied by a large table, but the second can't.

3) The stream of random numbers should be reproducible, both for debugging purposes and for comparison between different systems (or ones where just some parameters have been changed).

4) There should be an ability to generate several independent streams.

One of the earliest generators, credited to von Neumann and Morgenstern (the originators of the theory of games - although John von Neumann was involved in the origination of much that has existed only since 1940), uses the so-called mid-square method.

It simply starts from a four digit number, squares it, takes the middle four digits of the result, squares that, etc. It is not very good - tends to zero out fairly quickly - but was the starter.

Linear Congruential Generators.

A sequence of numbers Z1, Z2, …, is defined recursively

Zi = (aZi-1 + c) (mod m),

where m (the modulus), a (the multiplier), c (the increment) and Z0 (the seed) are positive integers.

To obtain a sequence of random numbers in [0, 1], we need to divide by m: Ui = Zi/m.

In addition to non-negativity, the integers m, a, c and Z0 should satisfy the inequalities:

0 < m, a < m, c < m, Z0 < m.

There are at least two obvious problems with this scheme.

problem 1:

The sequence is completely determined from the four parameters - not surprising, but the formula is surprisingly simple…

problem 2: the only values attainable are

0, 1/m, 2/m, 3/m, …, i/m, …, (m - 1)/m,

and this may be already unduly optimistic since choice for the other parameters and the approximations introduced by floating point computations may substantially reduce this set.

This implies that we should expect cycles: sequences Z0, Z1, …, Zn, where Z0 = Zn, and such that no other pair 0 ≤ j < k ≤ n satisfies Zj = Zk. The length of the cycle is called the period of the generator. It should be obvious that the period is at mostm.

Theorem: The Linear Congruence Generator Zi = (aZi-1 + c) (mod m),

has full period (= m) if and only if:

a) m and c are relatively prime;

b) if q is a prime factor of m, then q divides a - 1;

c) if 4 divides m, then 4 divides a - 1.

Prescription: take m to be the largest prime representable in the hardware and go. Noooo…. Why not? Efficiency of operations: the remainder after division by m needs to be obtained, and division is, usually, a slow operation. Choosing m to be of the form 2b, allows one to use hardware operations like shifts and make use of controlled overflow effects, thus speeding up the generation. Choosing a = 2l + 1 allows for a faster multiplication by a. Unfortunately, this has undesired effects on the properties of the generator - even though one can still have full period. The "Law of no free lunch" strikes again.

If c = 0, then we lose the full period property.

It appears still possible to obtain period m - 1 - with a careful choice of m and a.

The choice m = 2b appears to have various negatives: either much shorter periods, large gaps in [0, 1], or other bad statistical properties.

If m is chosen to be the largest representable prime, choosing a such that the smallest integer l for which al - 1 is divisible by m is l = m - 1 (a is a primitive element modulom), will give period m - 1. This kind of generator is called a prime modulus multiplicative LCG.

A difficulty with this method is the determination of a: no good recipes seem to exists, and trial and error can be rather time consuming...

Multiple Recursive Generators.

A general pattern for a generator would be provided by a function of the type

Zi = g(Zi-1, Zi-2, …, Zi-q) (mod m), q ≥ 1.

The linear congruence generators are a special case; a few generators are given as follows.

Quadratic Congruential Generator:

g(Zi-1, Zi-2, …, Zi-q) = a Z2i-1 + b Zi-1 + c

Multiple Recursive Generator:

g(Zi-1, Zi-2, …, Zi-q) = a1 Zi-1 + a2 Zi-2 + ... +aq Zi-q

"Fibonacci" Generator: (Fibonacci numbers for q = 2)

g(Zi-1, Zi-2, …, Zi-q) = Zi-1 + Zi-q

Obviously, the choices of the parameters determine the periods and the statistical properties. The first method is related to the mid-square one; the second gives rise to large periods (up to mq - 1); the third has reasonably large periods but unacceptable statistical properties.

A last possibility is to change both the multiplier a and the addend c according to two linear recursive generator streams each time we want a new random variate (Zi). It appears that both large periods and good statistical properties can be obtained: the cost is the extra time and complexity - nothing new here.

Composite Generators.

One can try to take two or more separate generators and combine them in some way to generate the final sequence of random numbers. The hope is that combining two bad generators in some relatively inexpensive way will give a good sequence - longer period and better statistical behavior.

Shuffling: use the first generator to generate [V1, V2, …, Vk]. Use the second generator to generate an integer in {1, 2, …, k}, say J. Now pick VJ. Replace VJ by the next number generated by the first generator, and repeat the process.

Additivity: let {Z1i} and {Z2i} denote the sequences generated by two different generators with different moduli. Let Zi = (Z1i - Z2i) (mod m) for some integer m. Finally set Ui = Zi/m for the (hopefully) uniform variate in [0, 1]. These appear to have long period and good statistical properties, while each of the individual generators does not need large multipliers.

Tweaking MRGs.

J different MRGs, with functions g1, g2, …, gJ, are implemented simultaneously to to form the sequences {Z1,k}, {Z2,k}, …, {ZJ,k}. Let m1 be the modulus from the first one; let d1, d2, …, dk be specified constants, and define

Yj = (d1Z1,j + d2Z2,j +…+ dkZJ,j) (mod m1)

Uj =Yj/m1.

See code - for a portable implementation with period approximately 3.1*1057, and excellent statistical properties

Testing Random Number Generators.

We will limit ourselves to introducing some empirical tests - to be applied to actual sequences produced by a generator. No theoretical discussion will be pursued.

1) Uniformity. This is just a special case of the c2 test with all parameters known.

Divide [0, 1] into k subintervals of equal length and generate U1, U2, …, Un, where k is fairly large (k ≥ 100) and n is large enough so that n/k ≥ 5 (recall that it is recommended that each interval contain at least 5 sample points). Let fj be the number of the Uj's in the j-th subinterval, and compute

which, for large n, will have an approximate chi-square distribution with k - 1 degrees of freedom, under the null hypothesis that the Ui's are IID U(0, 1) random variables.

We reject the null hypothesis at level a, if c2 ≥ c2k-1,1-a wherec2k-1,1-a is the upper 1 -a critical point of the chi-square distribution with k - 1 degrees of freedom.

Example:

2) Serial Test.

This is a generalization of the chi-square test to higher dimensions. Pick a positive integer d and construct d-dimensional vectors U1 = (U1, …, Ud), U2 = (Ud+1, …, U2d), …, Un, points in the d-dimensional hypercube [0, 1]d. Divide [0, 1] into k equal subintervals, and let fj1j2...jd be the number of Uj's having first component in subinterval j1, second in subinterval j2, etc.. Let

which has an approximate chi-square distribution with kd - 1 degrees of freedom. Again, make sure that each small hypercube is likely to contain at least 5 samples.

Why this test? If the individual Ui's are correlated, the distribution of the d-vectors Ui will deviate from d-dimensional uniformity. If d = 2, this means that the pairs (Ui, Ui+1) will tend to cluster along a diagonal - the SW-NE one if positively correlated..

Example:

3) Runs Test. (or Runs-Up test)

This is a test of independence, not of uniformity.

A run-up is a sequence Ui ≤ Ui+1 ≤ Ui+2 ≤ … ≤ Ui+n, of maximal length (i.e., any longer one, from either end, will violate the inequality). Now count the number of runs:

The test statistic is given by the formula

where

and the matrix of the aij is given by:

A derivation, and appropriate references can be found in D. Knuth, The Art of Computer Programming, Vol II, Seminumerical Algorithms.

For n ≥ 4000 (or larger), R will have an approximate chi-square distribution with 6 degrees of freedom, under the null hypothesis that the Ui's are IID random variables.

Example:

4) Correlation Test.

Do the generated Uj's exhibit discernible correlation? Simply compute estimates of the correlation at lags k = 1, 2, …, l.

Recall that the correlation at lag j in a sequence of random variables X1, X2, …, is defined as

We are interested in determining whether we have a uniform distribution, so that Xi = Ui (the samples are - hopefully -uniformly distributed in [0, 1]). From this hypothesis we have E(Ui) = 1/2, Var(Ui) = 1/12. We have Cj = E(Ui Ui+j) - 1/4, C0 = 1/12. Finally,

rj = 12 E(Ui Ui+j) -3.

How do we estimate it? Or, how do we estimate E(Ui Ui+j) ?

If n is large, we can use a subsequence, and we should obtain an acceptable estimate. What is this subsequence? Let

h = floor((n - 1)/j) - 1.

Then:

so that we are just using the sequence U1, U1+j, U1+2j,…

If the Uj's are independent, one can also show that

Finally, under the null hypothesis that rj = 0 - assuming n large - one can show that the test statistic

has an approximate standard normal distribution. This gives us a test for zero correlation with lag j at level a by rejecting the zero correlation hypothesis if |Aj| > z1-a/2.

The test should be carried out at several values of j.

Example:

Some observations on empirical tests.

The primary objection is that they all suffer from testing only what has been generated: what happens after one stops? Is there any guarantee that we did not stop just before our generator "goes bad"?

What does "going bad" mean?

There were a large number of studies about environmental causes for cancers - triggered by geographical clusters of cancers. Unfortunately, clustering is the way we would expect random occurrences to present themselves. The clustering could be noticeable enough to cause a town to start all kinds of studies, while the reality is that there are no unusual factors acting on the appearance of the cancers: statistics caused a panic.

Theoretical Tests: See Knuth, Vol. 2 - and, to some extent, our textbook.

Discussion of the C code.