Thursday, September 29, 2011

The Birthday "Paradox" and Guid Collisions

The Birthday Paradox asks the question: how many people do you need in a room before you have a 50/50 chance of at least one pair sharing a birthday? It is a "paradox" only in the sense that the answer is a surprisingly small number to most.

Here's how it works.

Let `p` = "the probability of at least one shared birthday among `n` people".

We assume a year of `T = 365` days and, as usual with probability problems, we recast it by looking for the complement of the answer.

Let `p = 1 - q`, (i.e., `q` is the chance of no shared birthdays at all).

Well, there are `((T),(n))` ways of picking `n` distinct days from `T` and there are `n!` ways of arranging each combination and there are `T^n` possible combinations. Hence,

`q = (n!((T),(n)))/(T^n)`

We expand `q` as follows:

`q = (n!T!) / ((T - n)! n! T^n)`
`qquad = (T!) / ((T - n)! T^n)`
`qquad = ((T - 0) / T) . ((T - 1) / T) . ((T - 2) / T) ... ((T - (n-1)) / T)`
`qquad = (1 - 0/T) . (1 - 1/T) . (1 - 2/T) ... (1 - (n-1)/T)`

When `x` is small, we can use the approximation `e^x ~~ 1 + x`. This give us

`q ~~ e^(-0/T) . e^(-1/T) . e^(-2/T) ... e^(-(n-1)/T)`
`qquad = e^[-(n(n-1))/(2T)]`

And, since we're approximating, `n(n-1) ~~ n^2`, hence

`q ~~ e^(-n^2/(2T))`

Through the magic of logarithms we can infer that

`p = 1 - q`
`qquad ~~ 1 - e^(-n^2/(2T))`

`e^(-n^2/(2T)) ~~ 1 - p`

`-n^2 ~~ 2T ln (1 - p)`

`n^2 ~~ -2T ln (1 - p)`

`n ~~ sqrt[-2T ln (1 - p)]`

Now, for `p ~~ 1/2`, we need `n = 23`.

So, at any given soccer match, there is a 50/50 chance that at least one pair drawn from the players and the referee will share a birthday.


Guid collisions

A guid is, essentially, a 128-bit random number. Guids are used as an easy way of giving statistically (globally!) unique names to things without ever having to look at any of the other things that we have named, or anyone else has named, or will ever be named.

How many things do we have to name before we risk, say, a one in a billion chance of a naming collision?

`n ~~ sqrt(-2^129*ln(1 - 10^-9))`
`qquad ~~ 10^15`

That is, we can assign guids to about a million billion things before we have a one in a billion chance of a collision. Having even a million records in any application is rare. To put this in perspective, a billion seconds is roughly thirty years.

For almost all applications, this means you don't have to worry about guid collisions! [There's a much higher chance of something else being wrong with your program than a guid collision causing you trouble -- indeed, adding code to work around guid collisions is quite likely to increase the likelihood of a bug in your program.]