WARNING: This post is going to skew long and won't contain any directly relevant fantasy advice to help you make your transfers this week. It might get a bit nerdy at times, though I'm far from classically trained in statistics so the concepts, if not always the terminology, should be accessible to all. I also won't reach any definitive conclusion as these models are a work in progress. Any input on anything not considered below or anywhere where you disagree with any conclusions can be posted in the comments or over at Shots on Target where you'll find a handy forum to discuss these issues. We're also only looking at forecasting goals here, assists will need a separate post.
Tracking historic success
The way I look at it, there are three distinct ways to track historic success:
I must admit that I struggle a bit here as it seems odd to totally ignore production to date when we still aren't totally certain about the relationships we'll explore below. If you were forecasting the chance of a die landing on '4' then, of course, historic data is totally useless, the odds are still 1/6. However, if you were unsure whether or not the die was loaded you might want to adjust your 1/6 estimate at some point once the data sample became significant. We don't have 'loaded' fantasy players but we do have outliers who have consistently shown an ability to out (or under) perform their underlying stats and thus taking some account of the historic production could act as a safety net to make sure we don't judge these players incorrectly. It's not an ideal solution and I'm still not sure if it's required but I do think this data at least deserves to be addressed rather than being simply discounted as unreliable.
Projecting future success
So what do we want to know about an individual player to help us forecast his future success? A few considerations:
1. How many, and what kind of, scoring opportunities is he getting each game?
Through three games this year, Michu had registered 8 shots, 4 of which were on target and all of which hit the back of the net. We will often comment that such a conversion rate is 'unsustainable', but what exactly does this mean? Well, the fact that Michu has hit the target 50% of the time looks about right and shouldn't be of concern. Last season midfielders hit the target around 44% of the time and it's reasonable to suggest that Michu is at, or above, league average. We'll get into individual adjustments below in point 2, but for now, we can conclude that this rate is roughly acceptable. The issue however is Michu's 4 goals from just 4 shots on target. Last season, midfielders converted shots on target to goals at a rate of 25% so we would have expected Michu to have just a single goal, not the four he has registered at this point. We would therefore conclude that, if Michu continues to get chances at his current rate, he should regress to the mean in the coming weeks and won't perform at the same rate as he has to date. Note that we are not saying everything will equal out so that over the season he will necessarily have converted 25% of his shots on target into goals, only that that is the expected outcome from here on.
Now, the next issue to consider is what kind of shots a player is getting. This is intuitive as shots in the box will obviously be converted at a higher rate than those from long range, but this point really needs to be emphasised when you consider the differences. The below table shows the percentage of different shot types converted to goals last season:
We can see that the differences are dramatic and thus we need to be careful when looking at total shots for players like Cazorla, who are prone to take a pop from well outside the area. Of course, he's very capable of hitting the back of the net from 30 yards, but even the most optimistic of Cazorla fans would have to concede that Fellaini's 30 total shots are quite a lot stronger than Cazorla's, when you factor in that 25 of Fellaini's were taken inside in the area compared to just 10 for the Spaniard. Indeed, using the averages above, and ignoring shots on target for a second, Cazorla would be expected to have scored 2.75 goals (10 shots inside the box*18% + 19 shots outside the box*5%), compared to 4.75 for Fellaini (25*18% + 5*5%).
One potential solution to the above dilemma is to purely look at shots on target, which have the strongest correlation to goals over the course of a season. The correlation between different player stats and goals last season are shown below:
Long term I think that exclusively looking at 'shots on target' could be the right answer, but I believe a small adjustment is needed when dealing with small sample sizes. Consider, for example, Papiss Cisse through seven weeks this season. He's registered a very useful 16 shots (12th among forwards), but has managed to hit the target just 4 times (t28th), not scoring in the process. Looking purely at shots on target would suggest that he 'should' have scored somewhere between one and two goals, depending on how clinical you believe he can really be (league average is somewhere around 34% but last season Cisse scored with 57% of all shots on target). The issue is that last season he hit the target with 54% of his shots, and did so with 46% of his shots in the Bundesliga with Freiburg. Therefore, it's likely that his 25% hit-the-target-rate should also improve, perhaps to as high as 50%, which would give him a projected eight shots on target for the year and thus an expected goal haul of between two and four for the year to date. Either way the data suggests he is due for some positive regression, the way we split it just dictates how much.
I would understand if others were keen to just look at shots on target but given the above, so long as we're dealing with small sample sizes, I plan to add a thin layer to the projection model to account for total shots, taking note however to adjust at the lower of a player's hit-the-target rate and the league average (this should hopefully take care of the likes of Suarez, who's never seen a shot he wouldn't take and historically has a poor on-target rate of 36% while at Liverpool).
2. How has he converted these chances in the past?
Let's go back to Cisse for a second. He has 16 shots with 4 on target but has yet to register a goal. We've acknowledged that the outcome likely doesn't match the process if we took his data over a larger sample size, but how can we adjust it? In short we have two options:
For better or worse, I think for players like Cisse we're left with no choice but to simply use a league average rate. We could consider having different rates for players with varying profiles, but then you get into a potential mess where we're applying judgements on whether Cisse (recently deployed out wide) is a wide forward of a true 'striker' and thus the whole system could get clouded.
To continue using Cisse as the subject, we'd get the below 'expected' goals:
The observant reader will note that, even when looking at Cisse's own individual on-target rate, I have still used the league average conversion rate to see how many goals he ultimately is forecast to score. I've settled on that approach because, in my research to date which I will repost soon, I've generally found the amount of control players have over that rate is fairly low. See also some great work here from James Grayson (h/t to 11tegen11 for the tip).
One of the landmark pieces of research in baseball asserted a similar fact about what happened after the ball left the bat: balls tended to land fair or be out at a fairly random rate for an individual player, but at an approximate constant for the league. Many didn't - and don't - believe the data to this day but it's been shown that year-on-year players can rank very highly and then very low in terms of getting the ball to land in the field of play and I believe a fuller investigation into shots on target will show a similar result (before any baseball fans jump in here, I understand BABIP is more complicated than that, but for simplicity's sake, I think that's a fair summary).
Now, kicking a ball is obviously not the same as hitting a ball, but there are stark similarities between the two events. Firstly, many shots take place with very little thought time, especially those played into the box. The skill to get these on target is undoubtable, but the ability to 'place' them in the corner? Less convincing. Second, the positioning of the defense and particularly the goalkeeper is outside of an attacking player's control. This can be in the form of a great save in the top corner, but also from hard shots ricocheting off defenders knees or poor headers looping over a diving keeper. Given that we're often only talking about ~100-140 shots and 10-15 goals in a season, these few anomalous and 'lucky' events can have a huge bearing on the outcome.
Until I see reason to change it, I will therefore use a league average conversion rate of shots on target into goals, splitting chances between those inside and outside the box.
Now we've established what a player has done and what he should have done to date on an individual basis, let's turn our attention to what his data means to his team and how this translates to future success.
3. Who has he faced?
In the past I have accounted for this simply based on goals scored/conceded but given the reliance on shot data for individuals, it seems like that is the best path to take for teams too.
The question, yet again, becomes whether we should look at total shots, shots inside the box or shots on target. The answer really lies in a chicken and egg like discussion on what dictates the type of shots a team will take during a game more: an attacking team's desire to take shots inside the box or the defensive team's ability to force long range efforts? That needs a whole other case study, so for now I'm going to crudely assume it's somewhere in the middle. We can generate an expectation of total shots, shots inside the box (and hence outside) as well as shots on target by looking at what, on average, a player/team has done against each opponent compared to the league average. For example, let's assume Southampton are playing West Ham at home this week. The calculation would look something like:
Note: those opponent averages are as one GW7 and not backdated to when the game took place. Given the risk of small sample, I'm okay with this.
So, to date, Southampton are underperforming the league average by 8% in total shots (3% in, 13% out). This means that when forecasting games, we would reduce the average shots surrendered by their opponents by 3% for those inside the box and 13% for those outside. With the inside-the-box numbers being so low, we can essentially conclude that Southampton are holding opponents to their average level, at least at home.
We also need to think about how a team's success impacts an individual player. Previously I have somewhat crudely looked at the percentage of goals a player has 'accounted' for and then used a team's weekly forecast to estimate a player's own success. Instead of goals we can look at shots, but then it starts to get a touch tricky. Let's look at an example (through 7 weeks this season):
What do with this data is a dilemma Should we use all three averages? Just look at shots on target? Create some sort of average? Based on Lambert alone we clearly need to differentiate between home and away data but after that it's less clear. I'd be open to suggestions here, but for ease, if nothing else, my plan is to look at the percentage of a team's shots inside/outside a player has and then use his own individual on-target rate (or where unavailable the league average) to determine how many will hit the target. We then apply this to who the individual player faces this week, or beyond . . .
The actual forecast
We can summarise the above points with an example for how we might forecast a player's totals for the upcoming week. Let's stick with Lambert as we have his data to hand, and we'll assume he's facing West Ham at home.
First, we work out how we think Southampton will fare in the game. To date, away from home, West Ham are surrendering 11 shots inside the box per game and 5.7 outside. Our adjustments from above (table 3) suggest that these totals should be slightly revised downwards, giving us forecast totals of 10.7 shots inside the box and 4.9 outside.
Of these, Lambert is forecast to have 23% of those shots inside the box, so 2.45, and 19% of those outside the box, so 0.9 (table 4).
At this stage, we could get involved in Lambert's individual conversion rates, but given that he's spent a good portion of his career to date knocking around the Second Division, he's going to get the league average rate. That means (from table 1), we're giving him 2.45 shots inside the box*20% + 0.9 outside the box*7% for a total of 0.6 forecast goals, which by the way is an excellent number (after all that would equate to 23 goals over a 38 game season).
So that's how the new player forecast data will work for goals, with a similar approach for assists which I will write up shortly. I realise that this isn't rocket science and probably isn't doing much more than a lot of you already do on your own, but I thought it was important to setout the starting point for a new model, that will hopefully continue to develop over the season.
On that, I now pass it over to you for a while. How can the model be improved? What extra factors should we include? Should any of the above process be changed? Be gentle, and I look forward to reading your suggestions. Next I'll look at assists and then move onto to tweaking some of the ratios we're going to use, such as the league averages, player historic rates etc. Oh, and congrats for getting through all that if you made it this far.
Tracking historic success
The way I look at it, there are three distinct ways to track historic success:
- Historic classic stats (goals, assists etc)
- Historic fantasy points (similar to classic stats but accounting for all fantasy relevant events)
- Historic underlying stats (a much deeper understanding of a player's performance including his involvement in different areas of the field, his shots taken, where those shots were taken from, his passes completed etc).
I must admit that I struggle a bit here as it seems odd to totally ignore production to date when we still aren't totally certain about the relationships we'll explore below. If you were forecasting the chance of a die landing on '4' then, of course, historic data is totally useless, the odds are still 1/6. However, if you were unsure whether or not the die was loaded you might want to adjust your 1/6 estimate at some point once the data sample became significant. We don't have 'loaded' fantasy players but we do have outliers who have consistently shown an ability to out (or under) perform their underlying stats and thus taking some account of the historic production could act as a safety net to make sure we don't judge these players incorrectly. It's not an ideal solution and I'm still not sure if it's required but I do think this data at least deserves to be addressed rather than being simply discounted as unreliable.
Projecting future success
So what do we want to know about an individual player to help us forecast his future success? A few considerations:
1. How many, and what kind of, scoring opportunities is he getting each game?
Through three games this year, Michu had registered 8 shots, 4 of which were on target and all of which hit the back of the net. We will often comment that such a conversion rate is 'unsustainable', but what exactly does this mean? Well, the fact that Michu has hit the target 50% of the time looks about right and shouldn't be of concern. Last season midfielders hit the target around 44% of the time and it's reasonable to suggest that Michu is at, or above, league average. We'll get into individual adjustments below in point 2, but for now, we can conclude that this rate is roughly acceptable. The issue however is Michu's 4 goals from just 4 shots on target. Last season, midfielders converted shots on target to goals at a rate of 25% so we would have expected Michu to have just a single goal, not the four he has registered at this point. We would therefore conclude that, if Michu continues to get chances at his current rate, he should regress to the mean in the coming weeks and won't perform at the same rate as he has to date. Note that we are not saying everything will equal out so that over the season he will necessarily have converted 25% of his shots on target into goals, only that that is the expected outcome from here on.
Now, the next issue to consider is what kind of shots a player is getting. This is intuitive as shots in the box will obviously be converted at a higher rate than those from long range, but this point really needs to be emphasised when you consider the differences. The below table shows the percentage of different shot types converted to goals last season:
Midfielders | Forwards | |
Total shots - inside box | 18% | 20% |
Total shots - outside box | 5% | 7% |
Shots on target - inside box | 35% | 39% |
Shots on target - outside box | 14% | 17% |
Table 1 - Conversion rates of shot types by position. Generated with HTML Tables
We can see that the differences are dramatic and thus we need to be careful when looking at total shots for players like Cazorla, who are prone to take a pop from well outside the area. Of course, he's very capable of hitting the back of the net from 30 yards, but even the most optimistic of Cazorla fans would have to concede that Fellaini's 30 total shots are quite a lot stronger than Cazorla's, when you factor in that 25 of Fellaini's were taken inside in the area compared to just 10 for the Spaniard. Indeed, using the averages above, and ignoring shots on target for a second, Cazorla would be expected to have scored 2.75 goals (10 shots inside the box*18% + 19 shots outside the box*5%), compared to 4.75 for Fellaini (25*18% + 5*5%).
One potential solution to the above dilemma is to purely look at shots on target, which have the strongest correlation to goals over the course of a season. The correlation between different player stats and goals last season are shown below:
Player Stat | Correlation |
Shots on target | 91% |
'Big chances' (per Opta) | 90% |
Shots inside the box | 87% |
Total shots | 86% |
Touches in opponents' box | 78% |
Table 2 - Correlation between different player stats and goals. Generated with HTML Tables
Long term I think that exclusively looking at 'shots on target' could be the right answer, but I believe a small adjustment is needed when dealing with small sample sizes. Consider, for example, Papiss Cisse through seven weeks this season. He's registered a very useful 16 shots (12th among forwards), but has managed to hit the target just 4 times (t28th), not scoring in the process. Looking purely at shots on target would suggest that he 'should' have scored somewhere between one and two goals, depending on how clinical you believe he can really be (league average is somewhere around 34% but last season Cisse scored with 57% of all shots on target). The issue is that last season he hit the target with 54% of his shots, and did so with 46% of his shots in the Bundesliga with Freiburg. Therefore, it's likely that his 25% hit-the-target-rate should also improve, perhaps to as high as 50%, which would give him a projected eight shots on target for the year and thus an expected goal haul of between two and four for the year to date. Either way the data suggests he is due for some positive regression, the way we split it just dictates how much.
I would understand if others were keen to just look at shots on target but given the above, so long as we're dealing with small sample sizes, I plan to add a thin layer to the projection model to account for total shots, taking note however to adjust at the lower of a player's hit-the-target rate and the league average (this should hopefully take care of the likes of Suarez, who's never seen a shot he wouldn't take and historically has a poor on-target rate of 36% while at Liverpool).
2. How has he converted these chances in the past?
Let's go back to Cisse for a second. He has 16 shots with 4 on target but has yet to register a goal. We've acknowledged that the outcome likely doesn't match the process if we took his data over a larger sample size, but how can we adjust it? In short we have two options:
- adjust player data based on league average conversion rates
- adjust player data based on their own individual historic rates
For better or worse, I think for players like Cisse we're left with no choice but to simply use a league average rate. We could consider having different rates for players with varying profiles, but then you get into a potential mess where we're applying judgements on whether Cisse (recently deployed out wide) is a wide forward of a true 'striker' and thus the whole system could get clouded.
To continue using Cisse as the subject, we'd get the below 'expected' goals:
- League average rate (table 1): 14 shots inside the box*20% + 2 shots outside the box*7% = 2.9 goals
- Cisse's individual on-target rate: 16 total shots*50% on target rate*39% (table 1) = 2.5 goals
The observant reader will note that, even when looking at Cisse's own individual on-target rate, I have still used the league average conversion rate to see how many goals he ultimately is forecast to score. I've settled on that approach because, in my research to date which I will repost soon, I've generally found the amount of control players have over that rate is fairly low. See also some great work here from James Grayson (h/t to 11tegen11 for the tip).
One of the landmark pieces of research in baseball asserted a similar fact about what happened after the ball left the bat: balls tended to land fair or be out at a fairly random rate for an individual player, but at an approximate constant for the league. Many didn't - and don't - believe the data to this day but it's been shown that year-on-year players can rank very highly and then very low in terms of getting the ball to land in the field of play and I believe a fuller investigation into shots on target will show a similar result (before any baseball fans jump in here, I understand BABIP is more complicated than that, but for simplicity's sake, I think that's a fair summary).
Now, kicking a ball is obviously not the same as hitting a ball, but there are stark similarities between the two events. Firstly, many shots take place with very little thought time, especially those played into the box. The skill to get these on target is undoubtable, but the ability to 'place' them in the corner? Less convincing. Second, the positioning of the defense and particularly the goalkeeper is outside of an attacking player's control. This can be in the form of a great save in the top corner, but also from hard shots ricocheting off defenders knees or poor headers looping over a diving keeper. Given that we're often only talking about ~100-140 shots and 10-15 goals in a season, these few anomalous and 'lucky' events can have a huge bearing on the outcome.
Until I see reason to change it, I will therefore use a league average conversion rate of shots on target into goals, splitting chances between those inside and outside the box.
Now we've established what a player has done and what he should have done to date on an individual basis, let's turn our attention to what his data means to his team and how this translates to future success.
3. Who has he faced?
In the past I have accounted for this simply based on goals scored/conceded but given the reliance on shot data for individuals, it seems like that is the best path to take for teams too.
The question, yet again, becomes whether we should look at total shots, shots inside the box or shots on target. The answer really lies in a chicken and egg like discussion on what dictates the type of shots a team will take during a game more: an attacking team's desire to take shots inside the box or the defensive team's ability to force long range efforts? That needs a whole other case study, so for now I'm going to crudely assume it's somewhere in the middle. We can generate an expectation of total shots, shots inside the box (and hence outside) as well as shots on target by looking at what, on average, a player/team has done against each opponent compared to the league average. For example, let's assume Southampton are playing West Ham at home this week. The calculation would look something like:
So, to date, Southampton are underperforming the league average by 8% in total shots (3% in, 13% out). This means that when forecasting games, we would reduce the average shots surrendered by their opponents by 3% for those inside the box and 13% for those outside. With the inside-the-box numbers being so low, we can essentially conclude that Southampton are holding opponents to their average level, at least at home.
We also need to think about how a team's success impacts an individual player. Previously I have somewhat crudely looked at the percentage of goals a player has 'accounted' for and then used a team's weekly forecast to estimate a player's own success. Instead of goals we can look at shots, but then it starts to get a touch tricky. Let's look at an example (through 7 weeks this season):
TOTAL SHOTS | Home | Away |
Lambert | 13 | 3 |
Southampton | 61 | 31 |
Lambert % | 21% | 10% |
INSIDE BOX | ||
Lambert | 8 | 3 |
Southampton | 35 | 17 |
Lambert % | 23% | 18% |
ON TARGET | ||
Lambert | 5 | 2 |
Southampton | 20 | 10 |
Lambert % | 25% | 20% |
Table 4 - Percentage of shots type taken by Rickie Lambert to date HTML Tables
What do with this data is a dilemma Should we use all three averages? Just look at shots on target? Create some sort of average? Based on Lambert alone we clearly need to differentiate between home and away data but after that it's less clear. I'd be open to suggestions here, but for ease, if nothing else, my plan is to look at the percentage of a team's shots inside/outside a player has and then use his own individual on-target rate (or where unavailable the league average) to determine how many will hit the target. We then apply this to who the individual player faces this week, or beyond . . .
The actual forecast
We can summarise the above points with an example for how we might forecast a player's totals for the upcoming week. Let's stick with Lambert as we have his data to hand, and we'll assume he's facing West Ham at home.
First, we work out how we think Southampton will fare in the game. To date, away from home, West Ham are surrendering 11 shots inside the box per game and 5.7 outside. Our adjustments from above (table 3) suggest that these totals should be slightly revised downwards, giving us forecast totals of 10.7 shots inside the box and 4.9 outside.
Of these, Lambert is forecast to have 23% of those shots inside the box, so 2.45, and 19% of those outside the box, so 0.9 (table 4).
At this stage, we could get involved in Lambert's individual conversion rates, but given that he's spent a good portion of his career to date knocking around the Second Division, he's going to get the league average rate. That means (from table 1), we're giving him 2.45 shots inside the box*20% + 0.9 outside the box*7% for a total of 0.6 forecast goals, which by the way is an excellent number (after all that would equate to 23 goals over a 38 game season).
So that's how the new player forecast data will work for goals, with a similar approach for assists which I will write up shortly. I realise that this isn't rocket science and probably isn't doing much more than a lot of you already do on your own, but I thought it was important to setout the starting point for a new model, that will hopefully continue to develop over the season.
On that, I now pass it over to you for a while. How can the model be improved? What extra factors should we include? Should any of the above process be changed? Be gentle, and I look forward to reading your suggestions. Next I'll look at assists and then move onto to tweaking some of the ratios we're going to use, such as the league averages, player historic rates etc. Oh, and congrats for getting through all that if you made it this far.
No comments:
Post a Comment