LSG 2011 Results

The lemonade stand game in 2011 had a more complex structure than in past years. An iteration (or repeat) of the competition consisted of choosing a random game and having every permutation of three bots play 100 rounds, for 33,600 rounds. There were 81,000 iterations, because after an initial run of 1000 iterations, the results were found to be statistically insignificant, and so the results were run for about a week.

Because a single random game is selected for each iteration, there is a strong correlation between the events in an iteration, so one has to be very careful in comparing different random variables when samples are drawn from highly correlated samples. Thus, I am measuring the standard error of the average of the total utility measurement for a team during an iteration. However, I have scaled the total and the error by a factor of 12,600 iterations, or the number of rounds in which a player is involved, such that the numbers below are in utility per round. On average, there is 144 utility available per round, or 48 per player. Some games have more or less utility.

Place	Team	Score+/-Standard Deviation
First	Rutgers	50.397+/-0.022
Second	Harvard	48.995+/-0.020
Third	Alberta	48.815+/-0.022
Fourth	Brown	48.760+/-0.023
Fifth	Pujara	47.883+/-0.020
Sixth	BMJoe	47.242+/-0.021
Seventh	Chapman	45.943+/-0.019
Eighth	GATech	45.271+/-0.021

Because of the correlations, I measured the utility difference per iteration (again scaled by 12,600) between two teams. Because a game can have more or less utility, and a third party may be better or worse at a certain game, this controls for these factors, providing a tighter analysis. Since there are 81,000 repeats, one can assume that the averages are roughly Gaussian, and so the z-scores tell the story of how accurate the measurements are. To put this plainly, the chance of me getting a different result by re-running this competition is less than 1/300 for third and fourth, and far less than 1 in a million. Of course, such confidence says nothing about a compilation bug, configuration mistake, or other error that cannot be covered by re-running the experiment.

Pair	Score+/-Standard Error	z-score
Rutgers-Harvard	1.402+/-0.007	204.855
Harvard-Alberta	0.180+/-0.010	18.583
Alberta-Brown	0.055+/-0.017	3.230
Brown-Pujara	0.877+/-0.016	54.797
Pujara-BMJoe	0.641+/-0.007	96.873
BMJoe-Chapman	1.299+/-0.007	189.280
Chapman-GATech	0.672+/-0.006	106.624

Finally, some high-level discussion of the meaning of the gaps: first of all, while there is a statistical difference between second and fourth place, and the techniques used were very different, there is still very little difference. More significant are the wider gaps between first and second, and the larger gaps found between bots below fourth. The most interesting question evoked by these numbers is to explain what distinguishes the Rutgers bot from the rest of the field.