One of the big problems with single-elimination tournaments, like that in the World Cup, is that they are exceptionally poor at producing absolute rankings for teams that don’t get through to the final round or so. In their preliminary ranking, Blood & Thunder attempted to use the score-differences from the Top 16 to rank the losers of those match ups (which included Team Scotland). This is problematic, as it doesn’t take into account the large skill differential at the top end of the tournament (even England, breaking records against USA, still scored *half* that of their opponents) – the *same* team placed against USA or Canada would record a radically different score differential, and this effect cannot be simply disentangled from the relative performance of the “second 8”.

Driven by this, we have performed some statistical analysis on the World Cup scores as a whole, in order to attempt to provide a firmer basis for ranking the 30 teams in terms of their actual strength, rather than their performance in a tournament.

Our approach is based upon the idea that the *ratio* of the scores in a bout are a better model for the relative skill of two teams than the difference. This is intuitively

The other advantage of choosing a relative skill measure based on ratios is that we can infer the relative skill of two teams who have not played each other by comparison via a common opponent of the two. If Team A’s relative skill against Team C is X and Team B’s against Team C is Y, then the relative skill of Team A against Team B should by X/Y.

We can also build longer chains of inference on the same model, involving more intermediate opponents, but clearly the error included also increases as the distance between the compared Teams increases. In order to reduce this, we can average over all of the possible chains of a given length to produce a composite relative rating for that pair, assuming that the errors will partially cancel.

(For example, if Teams A and B have not played each other, but have both played Teams C,D and E, then we average the ratios from A-C-B, A-D-B and A-E-B to produce the final “length 1” ratio comparison.)

Examining the structure of the chains of inference, we can always build a “length N” chain of inference up from combining the results of “length N-1” chains and shorter. Relying on the precalculated results for the previous chains improves the efficiency of our calculation by orders of magnitude and reduces the margin of error in our implementation.

We need to build chains of inference up to 4 teams long in order to estimate ratios for all of the possible pairings of teams in the World Cup, due to the fact that the teams in the Consolation rounds had the most limited competition in the Cup.

In order to provide a check on the accuracy of the inferences as chains become long, we also calculate the “self-ranking” of each team, when it is available for a given chain length. This is the strength that the inference assigns the team to if it was playing itself – clearly, for perfect inferences, this should always be 1. The deviation of the self-rankings from 1 is a measure of how much error the inference chains have accumulated so far. We used the self-rankings to select the best performing chain combination process (taking the geometric mean rather than the arithmetic mean as our average produces much better stability, as well as being theoretically justified), and, for the World Cup data, only the self-rankings for Sweden deviate significantly from the expected value (having a value of about 1.3 at rank 3). (We suspect that this is because Sweden is also the only team to have achieved a perfect shutout in the tournament, against Japan, and thus it encounters an inevitable error from the uncertainty this produces in the relative rankings between the two teams.)

Given a matrix of all of the ratios of skill between two teams, we can sort the list of teams via the full matrix (choosing the shortest chains of inference for each ratio we need).

We choose to use a topological sort for our data: a sorting approach that builds an ordered list from a tree of *dependencies – *in our case, the requirement that winners of a bout are ranked above losers. As topological sorts are not dependant on having a complete ordering for all items, we can perform topological sorts even at earlier rankings in the data, which returns an ordered set of “equivalence classes”, lists of teams that we can say are all superior to teams in the classes preceding them, and inferior to teams in the classes after them, but which we cannot separate using data available at this inference level.

At Rank *0* (using only data directly from bouts in the World Cup), the Topological equivalence classes are:

*******RANK 0********

Japan ,Switzerland ,PuertoRico ,SouthAfrica

Portugal ,Mexico ,Italy ,Wales ,Netherlands,Spain ,Chile

Denmark ,Norway ,Germany ,Greece ,Brazil ,WestIndies

Belgium,France ,Ireland ,NewZealand ,Colombia

Sweden ,Argentina ,Scotland

Finland

Canada

England ,Australia

USA

At each higher rank, we sort only within the topological classes provided at the rank before, to prevent higher inference levels destroying orderings of better provenance.

As our inference chains extend, we cannot guaranteed that cyclic dependancies will not form in our topological graph. These occur when a valid ordering can’t be constructed between teams, assuming that all of their priors are accurate (i.e., we derive Team A > Team B > Team C, but also have Team C > Team A). We find that these only occur rarely, and for inference chains of at least rank 2 (our first such example is West Indies > Norway > Brasil > Greece > West Indies at that rank). Our procedure is to return the equivalence class unordered to the pool, to allow a higher order of inference to attempt to break the cycle. (In the case of the example above, this does not happen at any inference level, leading us to believe that the teams should be equally ranked).

The progressive rankings produced from the original data set at increasing levels of inference are:

*************** RANK1***************

Japan ,Switzerland ,PuertoRico

SouthAfrica

Portugal ,Mexico ,Spain

Netherlands,Italy ,Chile

Wales

WestIndies ,Norway ,Greece ,Brazil

Germany ,Denmark

Ireland ,France ,Colombia

Belgium

NewZealand

Sweden ,Scotland ,Argentina

Finland

Canada

Australia

England

USA

*************** RANK2***************

Japan ,PuertoRico

Switzerland

SouthAfrica

Portugal ,Mexico ,Spain

Italy ,Chile

Netherlands

Wales

WestIndies ,Norway ,Greece ,Brazil

Denmark

Germany

Ireland ,France ,Colombia

Belgium

NewZealand

Argentina

Scotland

Sweden

Finland

Canada

Australia

England

USA

*************** RANK3***************

Japan

PuertoRico

Switzerland

SouthAfrica

Mexico

Portugal

Spain

Italy ,Chile

Netherlands

Wales

WestIndies ,Norway ,Greece ,Brazil

Denmark

Germany

Colombia

Ireland

France

Belgium

NewZealand

Argentina

Scotland

Sweden

Finland

Canada

Australia

England

USA

)

As with all rankings, this ranking is somewhat vulnerable to random chance on the day. In particular, blowouts are problematic for all rankings, as they contain almost no information (other than that one team was exceptionally better than the other), as the performance of the lower ranked team in scoring at all is essentially dominated by noise. At the opposite end of the scale, knife-edge games give the impression that two teams are very close in skill, but do not unambiguously provide a measure of which is the better (just one jam different might have changed the winner). Knife-edge rankings are problematic for our topological sorter, as they of course imply partitions that might not have existed if the other team had won, while blowouts present problems for our inference engine, as they return less information to the chains they participate in than other games.

In order to guard against the accumulation of error from the sample games we have, we added “fuzz factors” to the scores on each game, and recalculated the predicted rankings with the slightly perturbed scores. (That is, we construct a new set of scores, where all the games went very slightly differently, and then see what the resulting ranking we would have got in that case.) We repeated ranking calculations over 10,000 sets of perturbed scores for the inference and topological sorter. We then combined the sets of rankings together to gain an impression of how the ranking for a given team would vary.

This has two effects: random perturbations in scores tend to break the cyclic dependencies in the original data, allowing “drawn” teams to be separated statistically; and our higher-order, and more vulnerable, inferences are allowed to vary across their uncertainty ranges, allowing us to measure the actual certainty in their predictions.

Calculating the mean and variance for these rankings allows us to calculate a final ranking synthesised over all of the simulations.

Below is a “heatmap” from the completed run of simulations, showing the distribution of ranking for the various teams. The teams are provided in order of final ranking. The horizontal bar to the side of each team shows how likely the team was to be ranked in a given place, over all of the 10,000 possible rankings calculated. A completely white box shows that 100% of the rankings placed the team in that position, a completely black box shows that 0% of the rankings placed the team in that position.

We have also coloured each team name based on the estimated confidence in the team’s final ranking – light blue is high confidence in that ranking, and dark blue is low confidence.

The first thing that is clear from the ranking is that the results of the semifinal and finals are in no serious doubt. While an England v Australia bout would be an exciting affair, we have high confidence that England would win the matchup.

It is striking that Team Germany, who were eliminated from the tournament at the Group stage, are consistently ranked at 14th place by our analysis, with high confidence. Germany were hit by a particularly hard Group, facing both eventual-2nd-place England and eventual-9th-place Ireland, and there was no room for a third team in the Top 16 after the latter had qualified. This is a big problem with Group-selection processes into single-elimination tournaments, and we would have preferred to have seen a more “global” playoff scheme for the single-elimination phase (for example, three rounds of Swiss-selection, which pairs off teams first randomly, and then against teams who have won or lost as many times as they have) to avoid the “local minimum” problem.

There is some statistical spread in the middle part of the ordering, which is caused by the same effect as above. Teams on the border of the Top 16 selection can pop in and out of the Top 16 dependant on small perturbations of their performance, especially with poorly seeded Groups.

However, the spread is not terrible for any particular team, so we are happy to pronounce our ranking for the Blood & Thunder World Cup 2014 as:

1. USA

2. England

3. Australia

4. Canada

5. Finland

6. Sweden

7. Scotland

8. Argentina

9. NewZealand

10. Belgium

11. Ireland

12. France

13. Colombia

14. Germany

15. Denmark

16. Greece

17. Brazil

18. Norway

19. WestIndies

20. Wales

21. Chile

22. Netherlands

23. Italy

24. Spain

25. Portugal

26. Mexico

27. SouthAfrica

28. Switzerland

29. PuertoRico

30. Japan

As an additional check, you can see that our international ranking is compatible with the results of Super Brawl of Roller Derby, Road to Dallas and relatively compatible with the European Championships (as the latter was a single elimination tournament, precise compatibility is less of an issue). (The Blood & Thunder World Cup Tournament ranking is not, as the effects of the tournament are convolved with the power rankings themselves.)

[The source code for the inference tool is available from: https://code.google.com/p/ranking-chain-inference/ ]

Pingback: Blood & Thunder World Cup – the Wrap Up. | scottish roller derby