I've always found probabilities and randomness counter intuitive in many ways, and I'm sure I'm not alone. Getting good quality random data out of computers has always been difficult. Normally computer programmers have good access to various pseudo-random number generators (or PRNGs), but their quality varies a lot. These days it's possible to download good quality random numbers from the internet, and one of the Bangspace projects I'm interested involves building a random number source.
A question that comes up of course is how do you tell if a number is random? Of course you can't. If you say “pick a number between one and ten” and I say 3 is that random? Can you tell? Unfortunately you can't. And if that is true what does it mean to talk about the quality of random numbers?
We can only really tell something about random data when there is a lot of it, and the way to tell is to examine the data and look for patterns. If we find any sort of pattern then the data probably isn't really random. It's more or less the case that the easier it is to spot a pattern the less random the data is, or at least the lower the quality of the random data. For example, with a PRNG the only thing that you need to know to establish every number in a sequence is that starting seed for the generator. Every sequence that starts with the same seed will always contain the same numbers, and PRNGs normally don't have that many possible seeds (32 bits is very common, and whilst four billion sounds like a lot of starting positions, it isn't really for a modern computer).
Let's say we have a megabyte of data. A very simple test is that the number zero should only be present, on average, in one out of 256 bytes of data* [*The division of the data into bytes is entirely arbitrary and the same test works in the same way if you split it into 4 bit nibbles, or 16 bit words or anything else — the difference will be in the expected frequency (1/16 for nibbles, 1/65536 for words where it is 1/256 for bytes).].
Write a program to determine how many zero bytes are in the file
As an aside, the program is almost certainly going to do a linear scan of the data. If it is to run in sub-linear time then that means it takes the program less time to run than it does to scan all of the data.
Can you think of a way to get your program to run in sub-linear time?
Of course what is true for zeros in the file should be true for every other number as well.
Change the program so that it checks the frequency of all of the values from zero to 255 for the bytes in the file
Once we have this we can use all of this data together to do a further test on the randomness numbers.
If you plot a graph of the numbers against their frequency what should you see?
Finally, have a think about doing this in sub-linear time as well.
Can you get this program to run in sub-linear time?