While Roller Derby still has a huge problem with availability of public statistics, relative to most other sports, there’s quite a few ranking systems around to rank teams on score.
Most of them, however, draw from WFTDA’s own points-based ranking scheme, with the sole exception being Flat Track Stats’ Elo-based ranking. (The Scottish Roller Derby ranking mechanisms are also totally different to WFTDA’s, but we don’t believe that we have much mindshare.)
Meanwhile, there’s a whole field of research in statistical ranking approaches for sports; partly because of the financial importance of betting in sports, and partly because of the widespread availability of public datasets for the popular American college Football, Basketball and Baseball leagues, which makes testing models easy.
Of course, as these models are trained against American Football, it’s not obvious that they will work as well against Roller Derby – both because the level of “randomness” is different in different sports, and because Roller Derby’s mechanics are fairly unique*.
As always, we’ll be attempting to rank as large a number of teams as possible, rather than limiting ourselves to WFTDA members, or members of higher or lower brackets. This makes it particularly hard for any ranking system, as the skill disparity between the lowest and highest ranks is particularly large (and those teams have no competitors in common)†. We’ll also, for comparison, run the same algorithms against the WFTDA Top 40 teams [calculated as of 30April2016], to give them an easier run.
The ranking models we will be covering are the “Offense/Defense Model” of Govan, extended Massey rankings  and Keener rankings, Principal value decomposition, and a novel physically inspired spring model which we introduce in this article.
As a brief overview:
Govan’s Offense/Defense model assumes that teams have two ratings – “offense” (ability to score) and “defense” (ability to reduce opponent score) – which it attempts to calculate on the basis of an iterative series of estimations. This turns out to be mathematically equivalent to performing a matrix balancing procedure via the Sinkhorn-Knopp theorem. Uniquely amongst the rating schemes here, the ODM model is concerned with the actual scores produced by both teams in a game, rather than score differences or ratios.
Massey’s Least Squares Ranking approach has been covered in the blog before, as it is the basis of the simpler of the two current SRD rankings.
Keener’s paper actually covers multiple ranking methods which he had experimented with. However, the most commonly attributed ranking mechanism uses a modified score ratio [( score + n) / (total score + 2n) where n can be chosen] to form a matrix of values. The Perron vector (eigenvector with the largest eigenvalue) for this matrix is then taken to be the ranking of the teams. This is therefore a special case of the PVD case we discuss next.
Principle Value Decomposition methods are also matrix mathematics approaches, based on the concept that the largest eigenvalue of a matrix of observations corresponds to the most significant signal in those observations – so its associated eigenvector must give the most significant ranking values for the underlying variables (the teams). In our implementation, we borrow from a multidimensional graph layout algorithm to full out our matrix of results more completely first.
The novel Spring ranking mechanism uses a physically inspired approach to ranking teams. Model each team as a “puck” on a frictionless rod, connected to other pucks by springs which represent bouts played. Each spring has a relaxed length corresponding to the score ratio or score difference (or other metric) for the bout it represents, and a stiffness (resistance to compression and stretching) proportional to the recency of the bout (older bouts provide less resistance). We perform an optimisation to minimise the total energy of the spring system, and return the resulting positions of the pucks on the rod as their ranking (relative to the topmost puck). Unlike the other ranking mechanisms, this approach has a natural way to represent the different confidence/significance of a bout, due to age or other factors.
Massey, PVD and Spring ranking schemes all require a choice of metric to use for a bout. Conventionally, you could use Score Difference, or the logarithm of Score Ratio [not pure Score Ratio, as we need additive quantities]. We test both, as well as a modified “Keener” Ratio derived from his approach, with n chosen as 4‡.
As we’re sampling over a long period of time, Offense/Defence, Massey, Keener and PVD also can have “aging” parameters added to reduce the significance of older bouts to their calculation. This is somewhat adhoc, but we perform an optimisation process in order to get the best value for this in terms of prediction¶.
We ran the rankings on data derived from the Flat Track Stats bout database, sampling all bouts from the largest connected group of teams playing in the period [06 December 2015] to [06 May 2016]. (And also on the WFTDA Top 40, which is a subset of this group, for the same period – we have to extend our limits back to November to get a fully connected group here, due to the particular scheduling arrangements for T1 level teams!)
(The rankings and test results are on Page 2)
*Differences between roller derby and football include the fact that Roller Derby allows both teams to score simultaneously, includes simultaneous offense and defence, and allows rounds of scoring to be called off asymmetrically by the “dominant” scorer.
†In more technical terms, all ranking systems do better the more fully-connected the graph of team games is. (That is: the more games each team has played with other teams.) The silo-ized nature of roller derby communities means that the widest possible graph for any sampling period also has those silos connected by only one or two bouts, meaning that the stability of ranking between those silos is very low – if that game was anomalous, then it affects the relative rankings of many other teams who never played each other directly.
‡The value of n here was optimised, approximately, from tests against the Spring model, but handily also matches up with the value of a single pass, which is nice for theoretical reasons. As n increases, approximately, the relative value of a massive blowout is reduced (and the difference between scoring 0 points and 1 point is also reduced). Approximately, n interpolates between “score difference” and “score ratio” measures of importance.
¶Essentially, this is an iterative optimisation on the aging parameter based on the accuracy of predictions on the test dataset mentioned later. In other words, we optimise for forward predictive accuracy, specifically. A similar method was used to derive our optimal “home advantage” factor of 0.96.