## Performance and strength metrics in Elise
Although Elise is an extremely strong Scrabble® player, I am still interested in improving its play further. The simplest way to improve Elise's play strength is to improve the play of its static move evaluator -- that is, to improve the move-evaluation algorithms used by Elise's 'quick move'. It is easy to test changes in this code, and determine exactly how much improvement (if any) they make in Elise's 'quick move' play. This is because you can play many thousands of 'quick move' games in minutes, and if the new changes make any difference in the quality of play it will become apparent quickly.
But there are other, trickier ways to improve Elise's strength -- changes in the simulation code, for example, or clock management
changes, or improvements in Elise's tile inferences. These sorts of changes are not easily tested in the same way, since
they require many simulation games to be played, and each game might take 20 or 30 minutes to complete.
I expect that Elise is quickly approaching the point in which With the newest version of Elise (0.1.8), there is now the ability to run strength tests on your computer. These tests will play games of Scrabble® using experimental new features in Elise's move evaluation and tile inference code. The results are sent back to this website (the GCG files are copied into this folder for all to see and enjoy.) This feature allows me to split the testing work among many computers, and get a lot of simulation games quicker. These tests will allow me to determine which of these features improve play strength, and about by how much. Just as importantly, this data can help me to measure the trade-off between better evaluations and more time spent in the move evaluation code. Some changes might improve the quality of move evaluation significantly, but at the same time reduce the performance of the evaluator to the point where the feature loses to simply increasing the simulation ply. (We can always boost strength by increasing simulation ply, and if that performs better than the experimental features, then the new feature doesn't actually help.) But there are many random factors in Scrabble®, and in comparing performance of two algorithms it is important to be able to control for these random factors somehow. In a small sample size of games, the random elements of tile draws can dominate any skill differences between two players. Since, in these new tests, the two players are both just Elise with different settings, the difference in play strength is likely to be (relatively) small. I can always wait until I have many thousands of games in order to evaluate the experimental features, of course, but even then there may be too much noise to reliably draw conclusions unless we first have some ways to control for the random elements in Scrabble® play. I actually have developed several such methods, and I'd like to explain them briefly on this page. To illustrate the application of these methods, we'll use a dataset I have available right now: the testing games Elise has played against Quackle at various stages in its development over the past few months. In most of these games, Elise has been set at either fixed 3 ply simulation or a 10-minute clock game (on a several years-old Dell Optiplex 755 with 2 GB RAM, and quadcore Intel Core 2 Q6600 clocked at 2.4 GHz.) There are 302 games total in this dataset. These are all English-language games; 142 are TWL06, 106 are CSW12, and 54 are TWL98. Since both Elise and Quackle have perfect word knowledge and board vision -- neither player will miss any legal move -- we can rule out any disparities in these factors. This is one way in which examining computer v. computer games makes this sort of analysis easier. If we were examining games between human players, we would have both of these factors as possible skill elements that could differ among players, and they would surely be difficult to isolate without many, many games. Among human players, I suspect that differences in word knowledge and board vision are actually larger than other skill-based differences. Among computer players, there is no difference in these factors. You can see the results of these 302 games here. GCG files for most of these games are available in the game collection.
There is a measure implemented in Elise, for estimating tile luck, which is computationally expensive. Since it is expensive to calculate, I intend to
only use it sparingly in examining the strength test results -- only when it's necessary. This is the
"rate current rack draw" feature in Elise, available from the "Rack" menu. This feature quantifies rack draw luck by starting with the
player's rack leave from their previous move (or an empty rack at the beginning of the game), filling
it with tiles drawn from the pool of unseen tiles, determining the best move on the board, and repeating that thousands of times. By
comparing the Since this takes some time to evaluate, it's best to only use it when you really need it -- when you can't get a reasonable answer using any of the several simpler methods I'll explain here. It's here if we need it, if the data is "too close to call", but it's not the first tool to take out of the toolbox. There is an extremely simple measure that partially controls for tile luck by adjusting for the most important element of tile luck, blank draws. Simply break the player's record into three categories: their record when holding no blanks, their record when holding only one blank, and their record with both blanks. If a player's record is winning when blanks are even, that player is probably superior to his or her opponents. If a player wins more games with zero blanks than they lose when drawing both blanks, that is also an indication that they are probably a better player than their opposition. Let's try this with the 302 testing games in our example dataset:
Elise's record, with exactly one blank, is almost the same as its overall record, just over 61% winning
versus Quackle. (Draws are counted as half a win in the win percentages above.) Elise wins 30% of the games in which it draws
This is a simple way to roughly control for tile luck, but it is only useful if there is a Another metric, also easy to calculate, is to look at the average value of each player's racks. This value can be estimated by using Elise's rack statistics -- for each rack Elise can produce a good estimate of value, relative to the overall average rack. Let's call this metric the "average relative rack value", ARRV.
Example of ARRV calculationThe ARRV is mostly random from game to game. Sometimes a player will get both blanks, the Z and X, and three esses, and have an ARRV of 10. Sometimes a player will repeatedly draw six vowels and end up with an ARRV of -8. In most games the ARRV will be somewhere between -3 and 5.
Although ARRV is mostly random, there is some skill component involved: most expert players (and all competent computer players) will have an average ARRV, over
many games, greater than zero (in fact, for most expert players the long-term average will be somewhere around 2). This is because they take care to give themselves good tile leaves -- they do not turnover good tiles unless
they have an excellent play. They are also quick to exchange when their racks are No matter how good a player you are, though, over hundreds of games (in English language play) it is impossible to have an ARRV greater than approximately 3 (assuming, of course, you are playing to win and not fishing for better tiles almost every turn, or passing with a fantastic rack). This is because the random component of ARRV is much greater than any skill component. You can't consistently have tile luck that good over hundreds of games (if you're actually playing the game) -- in about 25% of games, you aren't going to get any blanks, and individual racks without blanks with relative values greater than 20 are rare. Blank-less games will often have an ARRV well below zero. On the other side of the scale, if you're an expert player, you are also highly unlikely to have an average ARRV below approximately 1.20 over any large number of games. (Novice players might, by poor rack management, sustain lower averages than this.)
ARRV, by itself, is not a measure of play strength, even over large numbers of games. A player who fishes for tiles too often will have a large average ARRV. A player with poor word knowledge may miss plays with good racks, and have a large average ARRV. An extremely skillful player may get more out of bad or mediocre tiles than other players, and have higher turnover with good tiles by playing lots of bingos, and accordingly have a relatively small average ARRV. And given the primarily-random nature of ARRV from game to game, any smallish differences in average ARRV are more likely to be attributable to chance than any sort of skill.
But ARRV, interpreted the right way, can point to a strength difference. Among players of comparable skill, the ARRV gives you information about the
result of the game -- given two players of equal skill,
Furthermore, the To illustrate this relationship, I had Elise play 1,000 games against itself. Both Elise players used the 'hi-leave' algorithms to determine their moves. Since both used the same algorithms to determine their moves, both players were equally strong. This table illustrates how the difference in ARRV corresponds with win probability and point spread:
You might notice there is a slight asymmetry here, as player 1 wins about 80% of the games in which it is ahead in ARRV, while
player 2 wins about 70% of the games in which it is ahead in ARRV. This is simply because "player 1" is the player
who always plays first, and playing first is an advantage in Scrabble®. I thought that I might remove this asymmetry by
examining, not the In any case, if you know which player has the higher ARRV, and you know the players are about equally strong, you can tell me the winner of the game and be right about three times in four. If you know that one player had a substantial advantage in ARRV, you can be even more confident about the outcome of the game.
If we have a set of games of interest that is sufficiently large, then we can compare performance in those games to what we would expect
based on differences in ARRV.
If one player wins most of the games in which he is at a In version 0.1.8, Elise will show you ARRV for each player from the "Game statistics" option in the "Game" menu. Can we apply ARRV to our Elise v. Quackle dataset? Sure we can. In that set, there are 157 games in which Elise had a lower ARRV than Quackle. In those 157 games, Elise played first 82 times and played second 75 times. In games where Quackle played first and was ahead on ARRV, we'd expect it to win about 80% of the time; in such games when Elise played first we'd expect Quackle to win about 70% of the time. In other words, in this set of 157 games, we'd expect Quackle to win approximately 117.4 games, or stated another way, that Elise's record in this group of games would be expected to be somewhere around 40-117... that's assuming no strength difference between the two players. Elise's record in this set, actually, was 83-74, 53% winning. This is quite a bit worse than its overall record against Quackle, but Elise is working from bad tile luck here. Even in a sample of 157 games we can safely reject the hypothesis that Elise wins only ~25.5% of the time. That strongly suggests, like the simpler 'who has the blanks?' metric above, that there is a strength difference between the two engines.
Applying the ARRV to games played with experimental Elise features will be more fruitful, in most cases, than the simple
blank comparison, as it adjusts for a lot more tile luck than simply the blank draws. Also, we have the ability to
look at games with
Another nice feature of the ARRV measure is that, if we apply it to a set of games, it helps us to identify many of the best-played and worst-played games. For example, this game
against Quackle is a fine effort by Elise according to ARRV; in this game, Elise's ARRV was -3.42 (quite poor), while
Quackle's ARRV was 14.25 (enormous). Quackle also played first. Given those facts alone about the game, you would expect Quackle to win,
with an average spread somewhere around The actual outcome of the game is an Elise victory by 19 points. If you examine the game and the viable alternative plays move by move, an important factor in this game was Elise's ability to infer tiles on opponents' racks -- Elise knew Quackle's racks were good (or excellent), and plays a tight defensive game. Quackle, for much of the game, has better tiles than Elise does, but nowhere to make the plays: or, on a few turns, the only place for its bingo is a place that also opens up a big scoring reply for Elise.
Now, sometimes you have great tiles but nowhere to play a bingo. Sometimes you might have several racks
that
But there is something more important than being able to determine the "best" games -- the ability of
the ARRV measure to distinguish games that are unusually poorly played is potentially even more useful. For example, if we can
identify the set of games Elise loses despite an Being able to identify such instances is important for any future testing of Elise -- especially testing of differences due to experimental features that are likely to be much subtler than the strength difference between Elise and Quackle. Here is another simple, independent way to compare play strength, which controls for tile draw luck. Elise's win probability estimates for each (simulated) move are output in the GCG files it creates. When Elise plays games against itself, these estimates are very accurate -- if you look at the set of all games where Elise as player 1, at some point during the game, estimated that it would win 60% of the time, then you will find that player 1 will in fact win very close to 60% of the games in that set. If you find all of the games where Elise, as player 2, estimates that it would win 82% of the time, you'd find player 2 would win around 82% of the games in the set. Obviously, there might be a couple percentage points of difference (though the more games you play the closer these estimates get) but on the whole, these win probability estimates are accurate. That should not come as a surprise -- the win probability tables were created by having Elise play millions of games against itself.
If you have a large set of games with win probability estimates given for a player's moves, and that player wins
Put another way: a player better than Elise will win games that Elise expects to win. A player worse than Elise will lose games that Elise expects to lose. A player better than Elise will win more games that Elise expects to win, than it will lose games that Elise expects to lose. A player worse than Elise will lose more games that Elise expects to lose, than it will win games that Elise expects to win. To figure out whether our opponent is better or worse than us, all we really have to do is compare our estimates of win probability to the actual results... that should be a quite accurate assessment. One advantage of this measure is that
The simplest way to examine this would be to regress the outcomes of the games against the win probabilities reported in the games. If Elise is playing against itself using the same settings (i.e. equal player strength), we should get an identity, GameOutcome = WinProbEstimate, or something very close to that, as the best-fit line. So we can compare the best-fit line to this. If Elise is playing an inferior opponent, we expect that our best-fit line will be above GameOutcome = WinProbEstimate over the interval [0, 100] (that is, we won more games than we expected to win.) If Elise is playing a superior opponent, we expect that our best-fit line will be below GameOutcome = WinProbEstimate over [0, 100] (we lose more games than we expect to lose). If the intercept of the best-fit line is very close to zero, we could just look at the slope of the best-fit line and compare it to 1, but we can't expect the regression to have that good of a fit unless we're looking at opponents that are very close in strength. There are a lot of other ways we could look at this data: If we have a sufficient number of games, we can divide them into 'buckets' (say, all the games containing Elise win probabilities of 70-71% in bucket A, 71-72% in bucket B, etc.), and then compare the win probabilities of the games in the bucket with the actual win probabilities. This accomplishes basically the same thing as a simple regression, but may substantially reduce noise by conflating the win probabilities around 50% to only one or two data points (almost every game, at the start, will have a win probability somewhere in the neighborhood of 50%.) Then again, the win probabilities become more accurate as the game progresses... so couldn't we also divide the win probabilities according to when in the game they were produced? Eliminating early game win probabilities may eliminate a considerable amount of noise and increase the fit of the regression line (or however we decide to examine the data) but then we may also lose sight of potential misevaluations by either player early in the game. That would not be desirable. Our 302-game set is too small to divide very finely, especially since not all of the GCG files have Elise's win probability estimates, but we can still examine the win probabilities in the games and compare them to the actual results. There are a total of 256 games in this dataset with win probability estimates, and 2,443 win probability estimates total. You can find the data here.
This data is consistent with the conclusions drawn from the blank records and ARRV, that there is a strength difference between the
two players in these games. The best-fit line modelling the game outcome is greater than The table on the right breaks the win probability estimates into 'buckets' and compares the estimated win probability to the actual winning percentage for each 'bucket'. Elise underestimated its actual winning chances in every bucket. In summary: These three measures are all relatively simple to calculate, provided GCG files with full racks for both players and Elise's win probability estimates are available. Since the strength tests include this data in their output .GCGs, they can all be used going forward . The blanks method is very simple, and can quickly and easily point out strength differences between players given a large enough disparity. ARRV, along with addressing play strength questions, has the helpful ability to distinguish groups of games that contain mistakes and games that are well-played, and it also allows us to examine results in games with extreme difference in tile luck. The win-probability method has the ability to control for luck without making any assumptions about "average" tile or rack values, which is valuable to control for what might be called 'board luck'. — CMS |