Monday, November 14, 2011

SnippetCompiler

From time to time I need to try something out in C# in a few lines of code. It's always a bit of a pain setting up a new Visual Studio solution for the purpose (I've traditionally used a PlayPen project for this).  I've recently discovered this little gem: SnippetCompiler.  It's essentially a mini-IDE for just this situation.  Smashing!

Thursday, September 29, 2011

The Birthday "Paradox" and Guid Collisions

The Birthday Paradox asks the question: how many people do you need in a room before you have a 50/50 chance of at least one pair sharing a birthday? It is a "paradox" only in the sense that the answer is a surprisingly small number to most.

Here's how it works.

Let `p` = "the probability of at least one shared birthday among `n` people".

We assume a year of `T = 365` days and, as usual with probability problems, we recast it by looking for the complement of the answer.

Let `p = 1 - q`, (i.e., `q` is the chance of no shared birthdays at all).

Well, there are `((T),(n))` ways of picking `n` distinct days from `T` and there are `n!` ways of arranging each combination and there are `T^n` possible combinations. Hence,

`q = (n!((T),(n)))/(T^n)`

We expand `q` as follows:

`q = (n!T!) / ((T - n)! n! T^n)`
`qquad = (T!) / ((T - n)! T^n)`
`qquad = ((T - 0) / T) . ((T - 1) / T) . ((T - 2) / T) ... ((T - (n-1)) / T)`
`qquad = (1 - 0/T) . (1 - 1/T) . (1 - 2/T) ... (1 - (n-1)/T)`

When `x` is small, we can use the approximation `e^x ~~ 1 + x`. This give us

`q ~~ e^(-0/T) . e^(-1/T) . e^(-2/T) ... e^(-(n-1)/T)`
`qquad = e^[-(n(n-1))/(2T)]`

And, since we're approximating, `n(n-1) ~~ n^2`, hence

`q ~~ e^(-n^2/(2T))`

Through the magic of logarithms we can infer that

`p = 1 - q`
`qquad ~~ 1 - e^(-n^2/(2T))`

`e^(-n^2/(2T)) ~~ 1 - p`

`-n^2 ~~ 2T ln (1 - p)`

`n^2 ~~ -2T ln (1 - p)`

`n ~~ sqrt[-2T ln (1 - p)]`

Now, for `p ~~ 1/2`, we need `n = 23`.

So, at any given soccer match, there is a 50/50 chance that at least one pair drawn from the players and the referee will share a birthday.


Guid collisions

A guid is, essentially, a 128-bit random number. Guids are used as an easy way of giving statistically (globally!) unique names to things without ever having to look at any of the other things that we have named, or anyone else has named, or will ever be named.

How many things do we have to name before we risk, say, a one in a billion chance of a naming collision?

`n ~~ sqrt(-2^129*ln(1 - 10^-9))`
`qquad ~~ 10^15`

That is, we can assign guids to about a million billion things before we have a one in a billion chance of a collision. Having even a million records in any application is rare. To put this in perspective, a billion seconds is roughly thirty years.

For almost all applications, this means you don't have to worry about guid collisions! [There's a much higher chance of something else being wrong with your program than a guid collision causing you trouble -- indeed, adding code to work around guid collisions is quite likely to increase the likelihood of a bug in your program.]

Tuesday, July 19, 2011

Robustness of the thermometer record

This is an interesting post up at The Blackboard: it turns out that taking random 10% samples of the weather stations and computing global means gives you a very consistent picture (also, check out Steven Mosher's comment).  This seems to align with Anthony Watts' Surfacestations project, which found that (somewhat to my surprise, at least) the poor siting of the majority of US weather stations doesn't actually affect the numbers very much.

These are both instances of good, basic fact checking carried out and reported honestly by eeevil sceptics.

Tuesday, July 05, 2011

Climate Model Accuracy on Absolute Temperature

It has always bugged me that the output of climate models are reported as trends rather than as absolutes.  Lucia Liljegren at The Blackboard putClimate model absolute temperatures together some time ago, which I think is illuminating.  The climate models are way off base, it would seem.

Tuesday, May 03, 2011

Adding Files Etc. as "Resources" to Visual Studio Projects

I recently wrote a unit test that needed some data that was too large to fit into a string literal.  To fix the problem, I added the file containing the string data as a resource to my project:

  • In Solution Explorer, right click on the project Properties;
  • Click on Open;
  • Click on the Resources tab;
  • Click on Add Resource and follow the dialogue for adding a file.
In the source code I can refer to the string contents of my resource file simply as

  • MyNameSpace.Properties.Resources.MyResourceName
Of course, this scheme also works for data needed in other situations (e.g., logo images), not just unit tests.

Friday, April 15, 2011

Regular Expressions and State Machines

I recently found out that some of my colleagues did not know (much) about regular expressions or state machines.  These are amazingly useful concepts that I use on a daily basis, so I thought I'd try to set down everything you need to know about these subjects in one page (or thereabouts).

Regular Expressions for Searching

The most common use of regular expressions is for searching through text files.  For example, if I wanted to search my source code for all class definitions that include the word 'Collection' in their class names, I would search on the following regular expression: 'class .*Collection'.  This is read as "the word class, followed by a space, followed by zero or more of (*) any character (.), followed by the word Collection".  That is, most characters are interpreted literally in a regular expression, but some are treated specially (e.g., '.' and '*').

Here's the typical core grammar of regular expressions (I use a, b, c to denote non-special characters and x, y, z to denote arbitrary regular expressions):
  • a -- the regular expression matching the (non-special) character 'a';
  • xy -- the regular expression matching x followed by y;
  • (x) -- the regular expression matching x (it is sometimes useful to treat a complex regular expression as a single item);
  • x|y -- the regular expression matching x or y;
  • x* -- the regular expression matching zero or more instances of x;
  • nil -- the regular expression matching the empty string (which in practice is just written as the empty string, rather than 'nil').
So, for example, the regular expression '(aa|bb)*c' matches any sequence of aa or bb pairs, followed by a c.  That is, the language recognised by '(aa|bb)*c' consists of the strings {c, aac, bbc, aaaac, aabbc, bbaac, bbbbc, ...}.

In practice, the language of regular expressions is extended with plenty of syntactic sugar.  Here are some standard extensions:
  • x+ -- the regular expression matching one or more instances of x (i.e., x+ = xx*);
  • x? -- the regular expression matching zero or one instance of x (i.e., x? = (x|nil);
  • [abc] -- the regular expression matching any of the characters a, b, or c;
  • [a-f] -- the regualar expression matching any of the characters in the range a..f;
  • [^abc] -- the regular expression matching any character that is not a, b, or c;
  • [^a-f] -- the regular expression matching any character that is not in the range a..f;
  • ^ -- the regular expression matching the start of a line of text;
  • \$ -- the regular expression matching the end of a line of text;
  • \a -- the regular expression matching the escape sequence a (e.g., \* matches a literal *, \t matches a tab character, \( matches an open parenthesis, etc.)
There are many more (see, e.g., http://msdn.microsoft.com/en-us/library/az24scfc(v=VS.100).aspx for the .NET 4 regular expression language).

Learning these is easy and really, really useful!

State Machines

Imagine a machine with a set of lights.  Each light is called a state and the machine is called a state machine.  When the machine is started, it always starts with a particular light on (all other lights are off); this is the start state.

The machine takes as input a sequence of symbols of some type (e.g., characters, GUI events, messages, whatever you like.).  As each symbol is read by the machine, which light is shining changes depending on which light was shining just before the new symbol was read and on what particular symbol was just read in.  For example, if the machine is in state 3, say (i.e., light number three is on) and the next symbol read in is a 'q', then light 3 might turn off and light 7 might come one (i.e., reading a 'q' when in state 3 causes the machine to change to state 7).  The machine has a transition rule like this for every possible input for every possible state.

Each of the lights on the machine is either red or green.  The green lights denote accepting states and the red lights denote non-accepting states.  If the input sequence finishes with the machine showing a green light, then the input sequence was accepted by the state machine.  Otherwise the input sequence was not accepted.

The language accepted by the state machine is the set of input sequences that leave it in an accepting state.  In fact, every regular expression has a corresponding state machine and vice versa -- they are two ways of describing the same thing!

For example, here's a state machine to match our earlier pattern (aa|bb)*c:

State Input Next state
s1 [start] a s2
b s3
c s4
(...) s5
s2 a s1
(...) s5
s3 b s1
(...) s5
s4 [accepting] (...) s5
s5 [error] (...) s5

Observe the inclusion of an error state (s5) which, like the Hotel California, you can never leave. This state just ensures that the state machine never enters an accepting state once the input has differed from the pattern we are trying to match.

[Aside: the fancy name for what I've just described is a deterministic finite automaton or DFA.]

Deterministic and Non-deterministic State Machines

The state machine described above is deterministic, meaning it is only ever in one state and each state transition is defined entirely by the current state and the next symbol to be read.  This is a very useful notion for designing programs that have to process sequences.

One can also imagine a nondeterministic state machine.  Such a machine would works like this: at any point, any collection of lights can be on (i.e., the machine could be in any one of these states given the input so far). The transition rules are relaxed so that each state can have multiple possible successors for each input, plus have the option of having successor states that can be reached without consuming any input at all.  For example, the transition rules for state 3 might say "move to state 7 after reading a 'q' or move to state 8 after reading a 'q' or move to state 9 without reading anything".  So, for our non-deterministic state machine, if light 3 is one of the lights that is currently on, then light 9 must also be on (because it is a "no input" successor to 3); also, if a 'q' is read next, then both lights 7 and 8 will come on.

Why is this idea useful?  Well, here's the magic: you can always turn a non-deterministic state machine into a deterministic state machine (how this magic happens is a little beyond the scope of this article, but the starting point is to imagine every possible subset of lights on the non-deterministic state machine as corresponding to a single state on the equivalent deterministic state machine).

It's easy to see how to construct a deterministic state machine that handles simple sequences of symbols or repetition of simple sequences, the tricky bit is handling alternation (i.e., regular expressions of the form x|y).  The easiest way of constructing a state machine to match x|y is to construct a non-deterministic state machine from the two state machines for recognising x and y (hopefully you can imagine how this might be done without too much difficulty), then we can turn that into a deterministic state machine.

In practice, textual regular expressions are typically just compiled into representations of their non-deterministic state machines and the matching algorithm simulates all possible state transitions at each point.  This is actually quite effective, since the cost of searching a string of m characters with an n state machine is just O(mn) (and typically much less).

What Can't Regular Expressions Match?

Roughly speaking, regular expressions can't match any pattern that requires counting to arbitrarily large numbers.  For example, you can't write a regular expression that checks whether the parentheses ( ) in a program are properly balanced, because a program can contain any number of parentheses.  

[What follows is for extra credit.]

There is something called the pumping lemma, which says that for any state machine with n states that accepts some input of length n or more symbols, then that input can be written somehow as xyz (where y is a non-empty sequence and the length of xy is at most n) and the machine must also accept all patterns of the form xy*z (i.e., the machine had to go through at least one loop while recognising the y part in xyz).  

How does this relate to counting patterns?  Well, let's say we did claim to have a state machine with 4 states that matched all and only programs with properly balanced parentheses.  Now consider the balanced input sequence '((((a))))', which our state machine duly accepts.  The pumping lemma tells us that this must be decomposable as some xyz where y is non-empty and the length of xy is at most 4.  Well, we only have four possibilities for x and y here -- it doesn't matter which one we end up with.  Let's say the answer happens to be x = '(((' and y = '(' -- leaving z = 'a))))'.  The pumping lemma says that our machine must accept all strings of the form xy*z, which clearly includes just xz, namely '(((a))))', which is clearly not balanced!  Therefore our claim must be wrong: our 7 state machine does not match all and only well balanced programs.  Since there is nothing in our argument that really depends on the 4 states starting point, the same thing must be true for any number of states that a parenthesis-matching state machine might have.  Therefore, there is no regular expression (or state machine) that matches all and only well balanced programs.  If you don't believe me, I encourage you to have a go at designing such a state machine!

I find all this stuff really rather beautiful.

What About Counting Support in Modern Regular Expressions?


Modern textual regular expressions usually include some support for counting, such as x{40,70} meaning "between 40 and 70 instances of the pattern x".  This functionality comes at a cost: since the size of the equivalent ordinary regular expressions for these "counting" patterns quickly grows out of hand, they are implemented using backtracking pattern matchers.  These are exponentially slower than non-deterministic state machines and it is easy to construct pathological search patterns that will cause these matching algorithms to take forever.

Further Reading


Here is a brilliant paper on implementing regular expression matchers.

Thursday, March 24, 2011

Bayes' Theorem

I recently read this beautiful explanation of Bayes’ theorem.  I’d always thought it was a statement of philosophy, but it isn’t: it comes from plain old probabilities.

The formula for the conditional probability of A being true given that B is true is

P(A | B) = P(A & B) / P(B)

That is, the proportion of things that are A that are in B is equal to the fraction of the proportion of things that are A and B over the proportion of things that are B (I like to think of these things in terms of Venn diagrams).

We can rearrange the above to get

P(A & B) = P(A | B).P(B)

Now for Bayes’ theorem: let’s write H for our hypothesis and E for our evidence.  We want to know how seeing the evidence E affects the probability of our hypothesis H being true.  From the first rule above we have

P(H | E) = P(H & E) / P(E)

Now we can apply the second rule to P(H & E) to get

P(H | E) = P(E | H).P(H) / P(E)

Et voila!  No magic at all.

As an aside, the raven paradox has convinced me that Bayesianism is philosophically superior to frequentism.

Monday, January 10, 2011

Debugging WCF

I'm relatively new to the world of Microsoft foundation libraries.  For the most part I am impressed, but I cannot understand the idea of development libraries doing their best to hide errors from you unless you first speak the magic rites (I would have thought there ought to be magic rites to cause error hiding, not the other way around).  Anyway, WCF -- Windows Communications Foundation -- is a pretty good RPC system which has this property.  To turn on tracing so you can debug your application, follow this advice:
After some Live Searching (i was actually googling, but let’s make Microsoft think at least someone outside of Redmond uses Live Search) i discovered that you can enable WCF tracing. Bingo! Why didn’t Juval Lowy’s book mention this? It’s supposed to be the WCF Bible…. Oh well, thanks to Google, i mean Live Search, we now know how to enable WCF’s tracing:

  <system.diagnostics>
    <trace autoflush="true" />
    <sources>
      <source name="System.ServiceModel"
              switchValue="Information, ActivityTracing"
              propagateActivity="true">
        <listeners>
          <add name="wcfTraceListener" type="System.Diagnostics.XmlWriterTraceListener"initializeData="WcfTrace.svclog" />
        </listeners>
      </source>
    </sources>
  </system.diagnostics>

You can also use the Service Configuration Editor tool which is available in the Windows SDK, but spare yourself the pain of that tool and just copy/paste this xml in your config file.
Now run the service again, and do whatever it was that triggered the weird client-side exception. After the exception occurred, open the WcfTrace.svclog file with either an editor (it’s not very readable though) or with the Microsoft Service Trace Viewer tool (which is not too bad actually).