The lemonade stand game in 2011 had a more complex structure than in past years. An iteration (or repeat) of the competition consisted of choosing a random game and having every permutation of three bots play 100 rounds, for 33,600 rounds. There were 81,000 iterations, because after an initial run of 1000 iterations, the results were found to be statistically insignificant, and so the results were run for about a week.
Because a single random game is selected for each iteration, there is a strong correlation between the events in an iteration, so one has to be very careful in comparing different random variables when samples are drawn from highly correlated samples. Thus, I am measuring the standard error of the average of the total utility measurement for a team during an iteration. However, I have scaled the total and the error by a factor of 12,600 iterations, or the number of rounds in which a player is involved, such that the numbers below are in utility per round. On average, there is 144 utility available per round, or 48 per player. Some games have more or less utility.
Place | Team | Score+/-Standard Deviation |
---|---|---|
First | Rutgers | 50.397+/-0.022 |
Second | Harvard | 48.995+/-0.020 |
Third | Alberta | 48.815+/-0.022 |
Fourth | Brown | 48.760+/-0.023 |
Fifth | Pujara | 47.883+/-0.020 |
Sixth | BMJoe | 47.242+/-0.021 |
Seventh | Chapman | 45.943+/-0.019 |
Eighth | GATech | 45.271+/-0.021 |
Because of the correlations, I measured the utility difference per iteration (again scaled by 12,600) between two teams. Because a game can have more or less utility, and a third party may be better or worse at a certain game, this controls for these factors, providing a tighter analysis. Since there are 81,000 repeats, one can assume that the averages are roughly Gaussian, and so the z-scores tell the story of how accurate the measurements are. To put this plainly, the chance of me getting a different result by re-running this competition is less than 1/300 for third and fourth, and far less than 1 in a million. Of course, such confidence says nothing about a compilation bug, configuration mistake, or other error that cannot be covered by re-running the experiment.
Pair | Score+/-Standard Error | z-score |
---|---|---|
Rutgers-Harvard | 1.402+/-0.007 | 204.855 |
Harvard-Alberta | 0.180+/-0.010 | 18.583 |
Alberta-Brown | 0.055+/-0.017 | 3.230 |
Brown-Pujara | 0.877+/-0.016 | 54.797 |
Pujara-BMJoe | 0.641+/-0.007 | 96.873 |
BMJoe-Chapman | 1.299+/-0.007 | 189.280 |
Chapman-GATech | 0.672+/-0.006 | 106.624 |
Finally, some high-level discussion of the meaning of the gaps: first of all, while there is a statistical difference between second and fourth place, and the techniques used were very different, there is still very little difference. More significant are the wider gaps between first and second, and the larger gaps found between bots below fourth. The most interesting question evoked by these numbers is to explain what distinguishes the Rutgers bot from the rest of the field.