How to Interpret a KPI
Why I am posting this? (interesting for soccer fans)
I watched the Man United vs. Bournemouth game today, and as a Man United fan I finally enjoy watching soccer again. One thing about this game disturbed me, and that is next year when VAR (video assistant referee) is in the premier league Lukaku’s goal would have been counted offside.
Let’s assume, the number crunchers at any soccer stat database put in one goal and call it a day. That stat doesn’t represent the reality of the moment. The goal was offside. It shouldn’t have counted and Romelu Lukaku shouldn’t be credited for a goal. Thus that KPI for Lukaku will now be “inaccurate”. Whose fault is that? Is it the people at the soccer databases? How will the people analyzing this data correct for this fundamental mistake? Do we blame the referees? Do we even care about this mistake? Imagine if there was a hypothetical player who only scored goals from offside positions and then VAR came, how would that change his ability?
Anyways, my point is this, KPI’s need to track reality of the situation if they are to be useful. Not only that but it is the role of your data team to invent new and useful KPI’s, ways in which the data tells us useful information about the real world. A simple number like goals to track player performance is not sufficient, because goals is a random variable, determined by many more important underlying features.
So everyone raves about Expected goal as the most useful statistic in soccer. If you were to tell me what is the most useful statistic to a soccer team, it would be how much does this action increase my probability of winning.
A brief aside, Manchester City played Southampton today, and the expected goals (or at least my guess would have Man City as a clear clear favorite.) That did not tell the story of the game however, because Zinchenko (the Manchester City) was lucky to get away with a foul on Ward Prowse, when the game was tied at 1-1 and then Southampton instantly conceded a goal, perhaps, because they were thinking about a refereeing error. The fact is the probability of Man City winning this game is not nearly as close as the expected goal analysis suggests, because the timing of goals are not independent, and given Southampton take a 2-1 lead, the probability of Man City scoring goes drastically down, because Southampton fundamentally change their play style.
How to use this info as a Data Scientist?
You are given some number, data, feature for your neural network, bayesian statistic, regression, hidden feature, super duper, complex, genetic algorithm, latent dirichlet allocation, clustering, principle component analysis, stochastic model. (Read I don’t care what model you use). You’re model represents a small world of logic that the computer follows, and you’re computer is pretty damn good about that. But the small logic world of the computer is not reality or real world logic. Reality has a lot of intricacies, that are really tough to understand by a simple number or feature. So make sure you understand what your data and features actually say about the real world and where they might go wrong. Because your model needs to incorporate the fact that your features are not accurate. Or Transform the features to be more accurate representing of the small world things which you wish to capture.