How the ranking works

Every time you pick a kit, two numbers move. Here is what is actually happening under the hood, and why the leaderboard is sorted the way it is.

The rating system: Glicko-2

Every kit starts at a rating of 1500. When two kits go head to head and you pick a winner, the winner gains points and the loser loses points. The amount that gets traded depends on how surprising the result was.

A heavily favoured kit beating an outsider gains very little, because the result is expected.
An upset where the underdog wins moves the ratings a lot, because the system just learned something new.

This is the same idea as the Elo system used in chess, but with one important addition. Each kit also tracks an uncertainty number, formally called the Rating Deviation, or RD. A brand new kit has high uncertainty (RD around 350). As it collects votes, its rating settles into a stable value and its uncertainty shrinks.

The full algorithm is called Glicko-2, and it was designed by Mark Glickman as an improvement on Elo for exactly this kind of problem, where some players have lots of games and others have only a few.

Why we rank by "rating minus 2x uncertainty"

Imagine two kits:

Kit A has 2 votes, both wins. Its mean rating is sky high.
Kit B has 80 wins and 20 losses. Its mean rating is a bit lower, but we are very confident in it.

Sorting by raw rating would put Kit A above Kit B, which is clearly wrong. Two votes is not enough evidence to crown anything. So instead we sort the leaderboard by rating minus 2 times the uncertainty, which is a conservative lower bound on the true rating.

Kit A's lower bound is low because its uncertainty is huge. Kit B's lower bound is high because its uncertainty is small. As Kit A collects more votes, its uncertainty shrinks and it climbs the board on merit, not on luck. The same trick is used by Reddit to sort comments and by Amazon to sort reviews.

How we pick which kits to show you

Random pairing would waste your time. Showing Brazil home next to an obscure away kit teaches the system almost nothing, because we already know the outcome. So pair selection is weighted toward matchups that are actually informative.

Cold start. Roughly half the time, the picker prioritises kits that have fewer than five votes so far. Every kit gets exposure quickly.
Close ratings. Among warmed-up kits, the picker prefers pairs whose ratings are near each other. A genuine 50/50 matchup teaches the system more than a foregone conclusion.
High uncertainty. Pairs where at least one kit still has a wide RD are favoured, because that vote will move the rating more.
No repeats. If you have already voted on a pair in this session, it gets a heavy penalty and is unlikely to show up again.

Keeping things fair

Votes are anonymous, tied only to a session cookie. To keep the leaderboard honest without forcing logins, the server enforces a short cooldown between votes and a daily cap per session. Each matchup is also issued a one shot token, so old or replayed votes get rejected.

Get in touch

Spot a bug, have a question about the maths, or just want to argue that your favourite kit was robbed? Email khimor@osbdata.com.

Back to voting · See the leaderboard