October 26, 2006

Early Predictions

I’ve been having some fun predicting the outcome of the season with very little data. Like many readers realized these predictions aren’t really predictions as such as they have way too much error and make predictions all people know wont happen, like Phoenix getting less than 30 points. This is largely because one game’s goals against can significantly affect the average (9 goals against in one game increases the average over 9 games by 1 goal, which is a lot in terms of winning percentage). So I needed a formula to get a more accurate average for teams with anomalies in their data set (results that should only happen 1% of the time, occurring in 10% of their games). The goal is to scale game’s scores that are too many standard deviations away from the team’s average to a smaller part of the team’s average.

The nice thing about these predictions is that in general I know the standard deviation for goals against (the problem data), if a team has a game that’s more than 2 standard deviations away from their average then that game, should only happen about 4 times in a season and 3 standard deviations once every 5 seasons. In hockey goals against occur with an average of 2.85 and standard deviation of 1.7. Thus an average team should have 8 games in a season with 5 or more goals against, a bad team probably 16. So I take any game that’s beyond 1 standard deviation and give it a smaller weight. For standard equal weights you multiply every term by 1/n, for my weights I multiply by some constant ci/(Σci), where ci = 1 if within 1 standard deviation of the average and ci = √(1.7/μ– gai) otherwise. The square root increases the amount of focus it puts on outliers, I used it because it produces slightly better results.

The neat thing about predictions in sports is that if you can define some algorithm you can test it on past data (and hope it works in the future). So, since I have 2005-2006 data, I can test this algorithm on the first few games of that season and see how well it performs. The first test any sort of regression should do is how much error does an average have. The total sum of squared error for the average is 7927, my original model actually increased the error (one team had error of 68 contributing to almost half of the total error). So I reapplied my normalizing algorithm mentioned above and got a sum of squared errors to be 5317, or 72% for an r2 of 33%. Or I could say with 12% of the games I was able to explain 33% of the final variability. 9 teams were within 5 points of their predictions (almost a third), and 19 were within 10. The standard deviation was 13 (for a ± 26 confidence interval 95% of the time), which is better than the 20 I predicted from before. The worst prediction was Dallas who got 38 more points than predicted (they had a lot of goals against early on, but won games) accounting for 27% of my error. While this is a regression style analysis, the prediction is not based on a regression, but simply on the assumption that goals for and goals against correlate with winning (which is known).

Problem with my predictions this year and 2005-2006 predictions is that there appears to be a more competitive start to the 2005-2006 season than this year. So the best prediction was 109 points and the lowest was 60, creating reasonable minimums and maximums. This season teams don’t seem to want to be competitive there are a significant number of good teams to start this season: Dallas, Anaheim, Ottawa, and Atlanta not to mention the bad teams: Philadelphia, Chicago, Columbus, and Phoenix. The question of course is whether this season will be less competitive than past season (possibly a direction of the new CBA – I’ll look into later). What I’m basically trying to say is that the best algorithm of 2005-2006 won’t produce the best results for 2006-2007, but they should be useable.

So without further ado, here are the 2005-2006 results.

WEST:
TeamPTSEPTSERR
Detroit Red Wings124102.55421.4
Dallas Stars11274.146537.9
Calgary Flames10389.730313.3
Nashville Predators10689.193316.8
San Jose Sharks9998.11160.9
Anaheim Ducks9891.1386.9
Edmonton Oilers9589.32135.7
Colorado Avalanche95914.0
Vancouver Canucks9290.57521.4
Los Angeles Kings8988.51260.5
Minnesota Wild8485.89791.9
Phoenix Coyotes8189.39558.4
Columbus Blue Jackets7480.43896.4
Chicago Blackhawks6583.168718.2
St. Louis Blues5777.053420.1

EAST:
TeamPTSEPTSERR
Ottawa Senators113107.5015.5
Carolina Hurricanes112102.1839.8
New Jersey Devils10179.003222.0
Buffalo Sabres11092.83817.2
Philadelphia Flyers10192.9838.0
New York Rangers10098.73561.3
Montreal Canadiens9388.24294.8
Tampa Bay Lightning9298.01736.0
Toronto Maple Leafs9080.05789.9
Atlanta Thrashers9084.37885.6
Florida Panthers8494.90210.9
New York Islanders7881.32423.3
Boston Bruins7489.315215.3
Washington Capitals7070.10980.1
Pittsburgh Penguins5880.255722.3

Of course I cannot predict or for that matter know what teams will do to solve problems early on. Boston for example (projected for 89 points) traded their top forward to San Jose (predicted exactly due to bad prediction for Dallas [over predicts San Jose to win games vs. Dallas] and Joe Thornton [Better player, better team] compensated each other). This may not be all that useful at this point, but it’s a start. Unlike most predictions I’m at least testing my hypothesis!

No comments: