The normal distribution is fully parameterised by its mean and variance:
With the (unbiased) estimators
where is a bias-correction factor, because . For large samples this is close to one, so this factor is typically ignored, but exact and approximate values for the normal distribution can be found. 5 6 7
The log-normal distribution is similar, with a change of variables :
We note that is normal distributed, which we will make use of later, as the normal distribution is simpler.
The multivariate form of the normal distribution, often abbreviated to MVN, is
with the parameters
is usually referred to as a covariance matrix.
We will consider a special parametrisation of the covariance matrix. The motivation is to removed restrictions on the entries. First of all, the covariance matrix is symmetric, so almost half of the entries are redundant when specifying the matrix. A more complex restriction, is that it must also be positive definite, i.e. and its diagonal entries are strictly positive.
First we consider the relation between the covariance matrix and the correlation matrix:
where is the correlation matrix, which have the nice property that its diagonal consists of units. We can find a lower-triangular square root , such that . However, does not have a unit diagonal, for this we multiply by the inverse of its diagonal to obtain , as shown in equation (12)
We let be the vectorisation of the lower triangular matrix, i.e. .
We may also consider the elementwise root-log transform of the diagonal of :
Thus, we may parametricise as from the vector , where is the -th triangular number by reversing the above steps. 8
In determining the outcome of a match, we are either right or wrong. If we have a rule determining the correct outcome of a match with probability , we have a binary distribution3.
Repeated determination of matches, results in a binomial distribution, where we expect the number of the correct guessed outcomes , such that .
The limiting distribution of a binomial variable as , can be approximated by a normal distribution. This theorem is sometimes referred to as the de moivre–laplace theorem, which is a special case of the central limit theorem4.
Typically, this approximation is practical for greater than .
The estimator for , , is simply the number of correct guessed over the total outcomes. . So the confidence interval, assuming normality, is of the form
We reiterate that this is unreliable for a small sample size, and other confidence intervals also exist. A list of other methods can be found in 9.
The poisson distribution is defined as the number of events occuring a fixed interval. It's described by the mass function in equation (17)
A remarkable property of the poisson distribution, is the simple relation between the mean and variance:
The difference between two independent poisson distributed variables is distributed by the skellam distribution. 10 The skellam distribution is of the form show in equation (19)
where is the modified bessel function of the first kind. The formula is complicated, but is shown below 11, page 375
The series is the multivariate generalization of the one-dimensional series, which is a special case of the series:
where . 12, page 84 In our case, we consider . We will also be assuming that . The distribution is stable as long as .
The unconditional expectation and covariance are given by 13
and the unconditional distribution of is
The conditional expectation and covariance are given by
and the conditional distribution of on is given by
The multivariate version of the unconditional and conditional distribution is: 14
We will later be using equation (28) and equation (29), with zero-shifted mean.
The continuous version5 is often attributed to Ornstein, Uhlenbeck and Vašíček. 15 16 The differential form and the formal solutions are shown in equation (30-31).
where the absolute value of the eigenvalues of should be strictly positive in the real part to be stable. 17, page 11 The relation between these two form can be found in the appendix, in equation (120-130)
The unconditional expectation and covariance are given by 18
We will be assuming , but we state the general forms for completeness.
and the unconditional distribution of is 19
The conditional expectation and covariance are given by 20
and the conditional distribution of on is given by 21, page 11
The multivariate version of the unconditional and conditional distribution is: 22
As with the process, we will later be using equation (38) and equation (39), with zero-shifted mean.
The relation to the model is not immediately obvious, but equality can be shown with the following substitutions: 2324, page 8
so the following two equations are equivalent
In some cases it may be useful to transform the variables, such as , because we may have more tools available for the transformed density. Such as log-transforming a log-normal variable, to obtain a normal variable.
To be precise, this is only valid as long as is a strictly increasing function. For the decreasing case we add a sign to either side; the two cases can be unified by applying the absolute value to both sides. 25
and the relation between and is given by
If we let , then
And for the opposite case, , then
The most general form of the likelihood function is defined as a mass/density function of a parameter given an outcome
In many cases, is a tuple of i.i.d. variables, and is a joint probability distribution of independent variables, and can be written as a product. So if is a vector of size , we have:
The likelihood function represents the probability of obtaining for a given . A reasonable assumption is then that is realized from a distribution where it has a high likelihood to be observed. So the that yields the highest likelihood is a natural candidate for determining the distribution of . We seek to determine
Product are often cumbersome, and to simplify the above computation, we can apply the logarithm to to obtain a sum. Because the logarithm is a strictly increasing function, the maximum of is also the maximum of . We denote this logarithm by :
From which we conclude that most likely derived from the distribution .
Bayesian6 hierarchical modelling is a way to describe a hierarchy of distributions, where the parameters of the upper layers are dependent on the distributions of lower layers.
The simplest such model is the two-stage hierarchical model, as shown in equation (56-58)
where is the observed data, is a parameter, and is a hyperparameter. is the prior distribution. In bayesian hierarchical modelling the hyperparameter is given a hyperprior, a distribution on the parameter, or .
In Empirical Bayes the hyperparameter is a fixed value. The estimation of this hyperparameter will be found using MLE.
The posterior theorem or bayes theorem is shown below in equation (59)
The distribution for can then be found by marginalization7 over the parameters, or random effects.
The bradley–terry model 26 is used to make paired comparisons of individuals in a transitive way, using a single parameter.
Where and is the logistic function (a sigmoid function). The relation between the logit function with the logistic function is that they are inverses, i.e. , so we also have the identity:
An example where this model is used, is in elo ranking, most commonly known from chess. A player's rating is given by a number , which is related to by . So a player with playing against a player with will yield the following probabilities:
A win or a loss updates the elo rating of each player, but the bradley–terry model has no mechanism for updating. Nor does it tell us how to initially calculate ratings, which must be found with inference.
We also note that the bradley–terry model does not handle ties, only binary win/loss outcomes.
The winner is the team with the highest score in a match. If the score is poisson distribution, we can determine the result by checking the value of , and checking if it is strictly positive, zero or strictly negative.
By defining the best team as having the score , we could use this to solve
for any to obtain a ranking score for all teams. Instead we will use the point system.
Template model builder is an R/Cpp-library which greatly simplifies the means of estimating hierarchical models. 272829. It makes use of three key concepts: automatic differentiation, laplace method and the (generalized) delta method.
Understanding these concepts are not necessary to make use of the program, but they can be helpful.
Dual numbers are 2-dimensional vectors with the unit vectors and denoted by or , with the added property that the square of the second component is identified with . 30, page 41-43
This has applications for differentiation, where it can be used to find derivatives without differentiating. We define two dual vectors8 and .
We note the similarity between the second component and the rules from differentiation. While the first four rules are trivial, the last requires an explanation. This is found by the tangent expansion9 of the function:
The chain rule is also applicable:
By mapping a variable to and a constant to , any10 function can be differentiated by applying the chain rule until sufficiently elementary functions can be computed in order.
For a thorough introduction to the automatic differentiation, we refer to the paper of the stan math library. 31 However, this is only useful for understanding the underlying math of TMB, not for using TMB, so it can safely be ignored.
The second tool is the laplace method, which helps us approximate the marginal distribution and estimate the mean of the fixed parameters and random effects.
The laplace method is based off the tangent series and a quadratic-exponential integral identity11.
We also assume that achieves its maximum at , i.e. , and that (or that the function decays sufficiently fast from ). And last, we assume that it achieves it's peak at , i.e. .
The proof then follows.
We will use the special case of , thus and
The multivariate form is slightly different. 32, equation 4TMB uses this to approximate the posterior distribution.
A related result is the posterior central limit theorem12. We leave out the details and the assumptions, but state the result, often attributed to Bernstein and Von Mises: 33 34
where is the observed information13 (i.e. , where is called the information) and is the number of observations.
The third tool is the (generalized) delta method,35, page 240-243 which helps us estimate the standard deviation of the fixed parameters and random effects.
The regular delta method states that if there exists a sequence of random variables such that
where denotes convergence in distribution, then for a differentiable function , then
The method used in TMB is a variant that approximates the distribution using the laplace method.
If the posterior distribution of is asymptotically normal with mode , then
where is the number of observations and is the inverse of the negative hessian of the log posterior
. Instead of the posterior, one may substitute this for , as the factor is constant (and thus its derivative zero). 36
The more general form where also depends on the random effects , TMB uses a more general estimate:
where , the random effects part of the hessian of the objective function, and , the jacobian of wrt. . 37
TMB does have some quirks and unexpected behaviour. We list some of those here:
These are practically non-issues, which have simple workarounds, but which one should be aware of. TMB is mature enough to have practical applications, as we demonstrate.
The model we will be using will be a two-stage empirical poisson-log-prior hierarchical distribution. Read that twice. We will be using different distributions on the priors, but they will look similar.
The hierarchical model is shown below
Before we describe the priors for the different models, it's important to note the strengths and weaknesses of this model, so we can set realistic expectations of the model performance.
Some of these issues may be resolved by modifying the model, but with a limited data set, and no way to produce more, it's important to keep the model simple to prevent too much overfitting.
Now, for the priors, we make the assumption that these are discretely distributed, or continuously distributed.
The simplest model is letting the team parameters be constant throughout the season. We will not study this model in detail, but we mention it.
where .
In the discrete case we can write the general function
where the absolute value of the eigenvalues of are smaller or equal than , and . In the initial case (unconditional case), we have
where , with in equation (28).
We consider three cases for conditions of :
In the continuous case we can write the general function
where the absolute value of the eigenvalues of should be strictly positive in the real part to be stable. In the initial case (unconditional case), we have
where the random integral has the distribution of , with as in equation (38).
We consider two cases for conditions of :
While the model itself describse the distribution of the score of a single team, that alone won't help us decide the match winner. For each match we sample from two poisson distributions, or from one skellam distribution.
As usual, the winner is the team with the highest score. If the score of each team is drawn from two poisson distributions, we get and . It is then a simple matter of comparing the two, i.e. check which condition holds in .
The equivalent condition in terms of the skellam distribution, is to define and check . So either of these may be used to determine the winner.
Winning a match gives the winning team three points, and the loser none. A tie gives each team one point. This system is known as three points for a win, and is common in football.
Each team plays against every other team twice, in a double round robin system. If there are teams, then each team plays matches. The score for each match is added up to a final score, from which the overall seasonal winner is determined.
After setting up the model, we want to find the optimal parameters
In the independent case the product of the priors reduce to .
With the accompanying log-likelihood:
By using TMB to apply the laplace approximation to the log-likelihood of the model, we obtain a function of (or in the continuous case), which we can maximize16. By maximization we obtain , the arguments maximizing the likelihood, and , the mode of the posterior, which we use to find the expected value for for team .
The average of the score parameters for the matches given time-independent parameters is given by
with and . We remark that , so this is only an approximation, nor does it account for dependencies.
This value is not very interesting, as it doesn't help us determine the better team; all lamda-values are equal here. To get comparable lamdas, we instead look at . They don't have any nice looking expressions, so we instead use numeric methods to approximate the mean and the variance.
Underfitting means to have a model that isn't sufficiently complex to model the target, i.e. the assumed model is too simple to accurately describe the target.
Overfitting is the opposite: having a model too complex for the target. This often tends to model the noise instead of the underlying distribution.
In both cases the fitted models fail to predict new data points. The aim in model selection is to find an optimal model that neither too strict or too flexible.
In figure 6 we have an example of a target distribution that is modelled by three different polynomials of different orders.
Visually, we can see that the dashed blue model is "best". To find this mathematically we use information criterion values, such as AIC.
AIC is a goodness-of-fit value, giving a lower value for better models. 39 A related value, which we call AIC star, is shown in the below equation:
For small samples, one may subtract a correction term to obtain the corrected criterion , but as this is approximately zero for large samples (assuming is small), we may ignore this.
The theoretical derivation of AIC includes the quantity , known as the deviance. So the usual definition of AIC is . Because the right hand side of the expression becomes easier to interpret, and it doesn't affect the ranking of the models, we omit the factor .
For the model in figure 6, we have information criterions for the three fitted models, and for five other polynomial models in figure 7.
Using the , we would correctly select the model with the same order as the target, but the coefficients would be different. The would choose a fourth order curve, which is also close to the true order. Simply relying on the likelihood would have selected an overfitted model; this is the reason for introducing penalizing terms.
The usefulness of comes from its simple assumptions. It doesn't assume anything about the model; as long as we know the likelihood and the number of parameters, we can calculate the criterion value.
Using the normal posterior distribution for the attack and defence parameters, we can simulate new values by drawing from the distributions. This can be used to numerically estimate the mean and deviations of the s, and the scores.
From repeated sampling of a season's result, we may obtain a distribution for a team's score for the given attack/defence parameters.
While not unexpected, home teams usually have an advantage in matches, often referred to as the home advantage. We model this by the parameter in our models. Statistically, the home team wins roughly 45 % of the matches in football. 40 In comparison, the home team won 47 % of matches in Eliteserien 2019. 41, so without any knowledge about the teams, a good strategy would be to just bet on the home team. This will be correct almost half the time.
The match result is either a win, tie or a loss.
Using the estimated values for each team in a given match, we can calculate the loss probability , the tie probability , and the win probability , using the skellam distribution. We then select the most likely result as our guess.
This method has a flaw in that will always be smaller than either or , as long as so it will never guess a tie. The proof is simple: , and is a maximum. I.e. increasing either or will make this value smaller. So either or must be greater than , and be our guess.
However, around a third of matches result in a tie, 42, but figure 2 shows the average of to be around , so unless the model is really accurate and can estimates below , this method will be wrong approximately one third of the time.
The match score is a pair of number, e.g. , or .
To make tie guesses more likely, we may want to use the mode instead. The mode represents the most likely score outcome, so instead of looking at what's most likely of a win, tie or loss, we look at each score outcome individually.
This makes sense because we fit the model to score, not the result of a match. However, it should more heavily favour ties, even when winning results combined would be more likely.
Another method that tries to correct the win bias, is to apply weights to the win, tie and loss probabilities. We then select the result with the highest weighted probability.
The resulting scores from 2019 are taken from 43. The scores are displayed in table 1.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 Bodø/Glimt | 1-2 | 2-2 | 3-0 | 4-0 | 0-0 | 3-2 | 3-0 | 5-1 | 2-0 | 1-1 | 3-3 | 2-0 | 4-0 | 2-1 | 4-0 | |
2 Brann | 1-1 | 0-0 | 2-1 | 1-0 | 0-0 | 0-0 | 1-0 | 0-1 | 0-1 | 2-1 | 2-1 | 1-1 | 2-3 | 1-5 | 1-1 | |
3 Haugesund | 1-1 | 1-1 | 0-0 | 0-2 | 0-0 | 1-2 | 4-1 | 0-1 | 2-1 | 1-1 | 3-0 | 2-2 | 5-1 | 1-0 | 1-4 | |
4 Kristiansund | 1-2 | 1-0 | 2-2 | 5-2 | 4-0 | 3-2 | 1-1 | 0-0 | 2-2 | 4-0 | 0-1 | 1-2 | 1-0 | 4-2 | 2-0 | |
5 Lillestrøm | 0-0 | 1-3 | 1-0 | 1-1 | 3-2 | 0-2 | 0-3 | 2-1 | 1-1 | 0-0 | 1-3 | 2-1 | 4-0 | 0-2 | 0-0 | |
6 Mjøndalen | 4-5 | 2-1 | 1-4 | 1-1 | 2-2 | 1-3 | 2-0 | 3-1 | 1-2 | 0-0 | 1-0 | 1-1 | 1-1 | 1-1 | 1-0 | |
7 Molde | 4-2 | 1-1 | 3-1 | 2-0 | 2-1 | 1-0 | 2-2 | 2-0 | 3-0 | 2-1 | 3-0 | 4-0 | 3-0 | 5-1 | 4-1 | |
8 Odd | 3-1 | 3-2 | 3-1 | 2-0 | 2-1 | 3-2 | 2-2 | 1-0 | 1-1 | 3-0 | 2-1 | 2-1 | 2-1 | 1-0 | 1-1 | |
9 Ranheim TF | 1-1 | 0-3 | 0-2 | 1-2 | 2-1 | 1-1 | 2-3 | 4-1 | 2-3 | 0-2 | 0-2 | 1-0 | 1-2 | 5-2 | 1-5 | |
10 Rosenborg | 3-2 | 0-0 | 0-2 | 1-0 | 3-1 | 3-2 | 3-1 | 1-1 | 3-2 | 1-0 | 3-2 | 0-0 | 5-2 | 5-1 | 3-0 | |
11 Sarpsborg 08 | 1-1 | 1-1 | 1-1 | 0-1 | 1-0 | 1-1 | 1-1 | 2-0 | 1-3 | 1-1 | 0-0 | 2-2 | 3-2 | 2-2 | 1-0 | |
12 Stabæk | 2-0 | 0-1 | 1-1 | 2-0 | 1-1 | 4-2 | 1-2 | 0-0 | 0-0 | 3-1 | 3-3 | 2-1 | 0-1 | 0-0 | 1-1 | |
13 Strømsgodset | 1-3 | 6-0 | 3-2 | 2-3 | 1-1 | 2-3 | 0-4 | 2-3 | 1-0 | 3-3 | 2-1 | 0-2 | 3-1 | 0-0 | 3-2 | |
14 Tromsø | 1-2 | 1-2 | 2-2 | 5-0 | 1-1 | 2-2 | 2-1 | 1-2 | 4-2 | 1-0 | 2-0 | 1-1 | 0-1 | 0-2 | 0-0 | |
15 Viking | 3-4 | 2-1 | 0-0 | 2-0 | 3-0 | 4-1 | 0-2 | 2-0 | 2-2 | 2-2 | 2-1 | 3-0 | 4-0 | 2-1 | 1-1 | |
16 Vålerenga | 6-0 | 1-0 | 1-2 | 1-1 | 0-3 | 2-0 | 2-4 | 1-0 | 1-1 | 1-1 | 1-1 | 0-2 | 2-0 | 4-1 | 0-4 |
After using the data, a minor inconsistency was found: Rounds are not in order, so the number of games each team have played up until a match may differ. This is due to scheduling issues, so "round 2" may be moved to after "round 12", but round numbering is not renamed to account for this.17
We first look at the estimated results for the rankings. These are summary statistics of the predictions of each match, which we will go through next.
The estimated ranks are shown below. The first table is the time-independent model, shown in figure 8. The next three are the discrete models, shown in figure 9, figure 10 and figure 11. The last two are the continuous models, shown in figure 12 and figure 13.
We note that most estimated medians are drawn to around 45 points, away from the extremes.
In table 2 we see the estimated parameter values for each model. In table 3 we have the AIC values for each model. By this measure, we see that the discrete random walk model to be the better one.
T.I. | Model | Parameter | Estimate | Standard error |
---|---|---|---|---|
Noise | ||||
Discrete | Model | Parameter | Estimate | Standard error |
White noise | ||||
Vector autoregressive | ||||
Random walk | ||||
Continuous | Model | Parameter | Estimate | Standard error |
Vector autoregressive | ||||
Random walk | ||||
T.I. | Model | |
---|---|---|
Noise | ||
Discrete | Model | |
White noise | ||
Vector autoregressive | ||
Random walk | ||
Cont. | Model | |
Vector autoregressive | ||
Random walk |
Using the discrete random walk model, we can obtain the expected final standings, as shown in table 4. Only the top three teams were correctly predicted, however, most other scores weren't statistically different. With three teams with 40 points and four teams with 30 points, discrepencies ought to be expected.
Rank | 2019 result | TMB RW prediction |
---|---|---|
1 | Molde (68) | Molde (54.707) |
2 | Bodø/Glimt (54) | Bodø/Glimt (49.680) |
3 | Rosenborg (52) | Rosenborg (48.989) |
4 | Odd (52) | Viking (47.537) |
5 | Viking (47) | Stabæk (42.080) |
6 | Kristiansund (41) | Haugesund (41.905) |
7 | Haugesund (40) | Odd (41.779) |
8 | Stabæk (40) | Kristiansund (40.890) |
9 | Brann (40) | Strømsgodset (39.674) |
10 | Vålerenga (34) | Tromsø (38.654) |
11 | Strømsgodset (32) | Mjøndalen (37.138) |
12 | Sarpsborg 08 (30) | Vålerenga (37.105) |
13 | Mjøndalen (30) | Ranheim TF (36.674) |
14 | Lillestrøm (30) | Sarpsborg 08 (35.836) |
15 | Tromsø (30) | Lillestrøm (35.748) |
16 | Ranheim TF (27) | Brann (35.538) |
There are two ways to estimate results. One way is to use data from the entire season, and "predict the past". Or we can use all matches up to a certain date and predict the following match(es), and "predict the future". We first present the past predictions.
For these predictions we have used the continuous VAR model, for no particular reason; the discrete random walk model might have been a better choice.
For the most likely result rule, we can see the predicted result of each match in table 5. The table is very information dense, so we have the confusion matrixes below.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 Bodø/Glimt | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
2 Brann | A | H | H | H | H | A | H | H | H | H | H | H | H | A | H | |
3 Haugesund | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
4 Kristiansund | H | H | H | H | H | A | H | H | H | H | H | H | H | H | H | |
5 Lillestrøm | A | H | H | H | H | A | H | H | H | H | H | H | H | A | H | |
6 Mjøndalen | H | H | H | H | H | A | H | H | A | H | H | H | H | H | H | |
7 Molde | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
8 Odd | H | H | H | H | H | H | A | H | H | H | H | H | H | H | H | |
9 Ranheim TF | A | H | H | H | H | H | A | H | A | H | H | H | H | H | H | |
10 Rosenborg | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
11 Sarpsborg 08 | A | H | H | H | H | H | A | H | H | A | H | H | H | A | H | |
12 Stabæk | H | H | H | H | H | H | A | H | H | H | H | H | H | H | H | |
13 Strømsgodset | A | H | H | H | H | H | A | H | H | H | H | H | H | H | H | |
14 Tromsø | A | H | H | H | H | H | A | H | H | H | H | H | H | H | H | |
15 Viking | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
16 Vålerenga | A | H | H | H | H | H | A | H | H | A | H | H | H | H | A |
v
in the printed version. Incorrect predictions are orange or marked with an x
.We see the confusion matrixes of the past predictions below in table 6, table 7 and table 8. For the weighted rule, we found the weights , so this weighs loss probabilites more.
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 13 | 0 | 41 |
Tie | 10 | 0 | 63 | |
Home | 3 | 0 | 110 |
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 1 | 46 | 7 |
Tie | 1 | 49 | 23 | |
Home | 0 | 50 | 63 |
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 30 | 0 | 24 |
Tie | 24 | 0 | 49 | |
Home | 17 | 0 | 96 |
The weighted score outperformed the two other rules, but the weights are likely to be biased, and may not apply to other datasets.
The future predictions involved fitting the model to all matches before a certain date, and make a prediction for the next match using the fitted model. So the model parameters change for each prediction.
As with the past predidction, we have the predicted result of each match in table 9 using the most likely result rule. Most of this table is uninteresting, but we note the first two matches between Odd-Brann and Vålerenga-Mjøndalen. Because we have no prior knowledge of their performance, we are unable to make predictions for these matches, so we only predict 238 matches.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 Bodø/Glimt | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
2 Brann | H | H | H | H | H | A | H | H | H | H | H | H | H | A | H | |
3 Haugesund | H | H | H | H | H | H | H | H | A | H | H | H | H | H | H | |
4 Kristiansund | H | H | H | A | H | H | H | H | H | H | H | H | H | H | H | |
5 Lillestrøm | A | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
6 Mjøndalen | H | H | H | H | H | A | H | H | A | H | H | H | H | H | H | |
7 Molde | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
8 Odd | H | NA | H | H | H | H | H | H | H | H | H | H | H | H | H | |
9 Ranheim TF | A | H | H | H | H | H | H | A | H | H | H | H | H | H | H | |
10 Rosenborg | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
11 Sarpsborg 08 | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
12 Stabæk | H | H | H | H | H | H | A | H | H | H | H | H | H | A | H | |
13 Strømsgodset | H | H | H | H | H | H | H | A | H | H | H | H | H | H | H | |
14 Tromsø | H | H | H | A | H | H | H | H | H | H | H | H | H | H | H | |
15 Viking | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | |
16 Vålerenga | H | H | H | H | H | NA | A | H | H | H | H | H | H | H | H |
v
in the printed version. Incorrect predictions are orange or marked with an x
.While we are mostly interested in the proportion of correct guesses at the end of the season, it's also interesting to see how this evolves during the season, as shown in figure 14. It's relatively stable, but weaker than the past prediction methods.
We see the confusion matrixes of the future predictions below in table 10, table 11 and table 12. We use the same weights for the weighted rule as we did for past predictions, i.e. ; this introduces some bias. This can be remedied by updating these weights for each round as well.
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 6 | 0 | 48 |
Tie | 4 | 0 | 69 | |
Home | 4 | 0 | 107 |
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 0 | 41 | 13 |
Tie | 0 | 48 | 25 | |
Home | 0 | 68 | 43 |
Predicted | ||||
---|---|---|---|---|
Away | Tie | Home | ||
Actual | Away | 9 | 0 | 45 |
Tie | 15 | 0 | 58 | |
Home | 16 | 0 | 95 |
We note that all the results of the future predictions are worse than the result of past predictions. This is expected, as we have less data to rely on.
We have seen that football matches are hard to predict, but that TMB is a useful tool in determining this. TMB made it easy to create and fit multiple models, without prior knowledge of dual numbers or laplace approximation, but simply through the likelihood function.
While the time series model had some issues, and most not better than a time independent model (as shown in table 3), they were interesting to study and model.
This paper has mostly been exploratory, and points to several directions that can be explored. We go through a few paths one may use for continued study.
We can change the values we fit the model to. A drawback of the points system in football is that wins are given extra weight with three points. So a single goal may add two points. When fitting, we tuned our parameters to the goal counts, not the result (win, tie, loss). One could instead use the result for fitting.
The model can be extended to include multiple seasons. This may improve the prediction precision, and get more accurate parameter estimates. Though, this may be difficult as some teams are removed and new ones added every season.
While predicting match and standing results is interesting, there are other results which would be interesting to predict, such the odds. These are usually also available from betting sites, and could be interesting to compare.
On the other end, it may be "obvious" to include more data to fit the model, such as player age, or match length, travel distance between matches, they may also be be redundant, and lead to overfitting.
Variables such as seasonal change were unaccounted for; while this falls under the same category as being prone to overfitting, it's possible for a team to progressively become better or worse during a season.
During the idea stage of the paper, the idea of intransitive ordering between teams came up. This would have been interesting to study further.
The kronecker product and sum are useful for solving matrix equations. The product has a simple definition, but the sum is more complicated, and is related to the product by the matrix exponential.
Before we define the sum, we will give two more examples to show that kronecker product is not commutative:
While the left hand sides are different in the two equations, the structure of the resulting matrix is clearly different.
The kronecker sum is defined as .
This is notably not commutative either, which is unexpected for something called a sum.
The relation to the matrix exponential is given by .