Performance prediction with a hierarchical poisson model using Template Model Builder

Eivind Baltzersen
Norwegian University of Science and Technology
Department of Mathematical Sciences

Sammendrag

I dette prosjektet bruker vi en hierariske poisson–log-normal angreps- og forsvars-modell for Eliteserien 2019 til å finne forventet sluttresultat i tabellen. Vi bruker forskjellige autoregressive tidsrekkemodeller til å modellere endring i angreps- og forsvars-parametrene mellom kampene. Modellene viser seg å være dårlige til å predikere invidiuelle kamper, men bedre til å predikere lagenes sluttposisjon i tabellen. Modellene er implementert i TMB, et R/C++-bibliotek, og plottet i Python. Vi ser på noen svakheter ved modellene, og til slutt foreslår vi noen endringer som kan forbedre dem.

Abstract

In this project we use a hierarchical poisson–log-normal attack and defence model for Eliteserien 2019 (The Norwegian primary football competition) in order to predict the expected final results in the standings. We use different autoregressive time series models to model change in attack and defence parameters. We show the models to be unsuitable for predicting individual matches, but better at predicting a teams position in the final standings. The models are implemented using TMB, an R/C++ library, and are plotted in Python. We look at weaknesses of the models, and finally we suggest changes to improve the models.

Table of contents
Table of tables
Table of figures
Table of listings

Notation and terms

e The inverse of e, e-1 or 1/e.
cosh(x) The hyperbolic cosine function, ex-e-x2
[a..b] Discrete closed interval, {a,a+1,,b-1,b}
(x),L(x) (log-)likelihood function
τ 2π
positive/negative numbers includes zero
strictly positive/negative numbers excludes zero

1 Introduction

Ranking is a way to compare different objects given their properties. It is ideally transitive, meaning that for n objects, we may label them with i[1..n] to order them. An important aspect of competitive sports and games is to rank teams and players in order to sufficiently determine who is to be labeled 1, (also known as the winner).

In a sports context, the ranking often depends on a score, which is determined based on the performance of the team or player against another team or player. This is achieved through each team or player playing against other teams and players, for instance through a knockout or a knockout tournament. Examples from football1 in Norway include the knockout cup NM i fotball, and the top level round robin tournament is called Eliteserien.

To determine the winner of a football match, we simply look at the team with the most goals within two 45 minute halves of a 90 minute match. Some tournaments also have overtime if the teams are tied after 90 minutes, but we will mainly study Eliteserien, which doesn't include overtimes. The winner of a match gets three points, the loser none. A tie awards both teams with one point. The winner of the league is the team with the most points.

We make note of earlier work, such as 1, which attempts an attack-defence model that we will propose here, but with a different hierarchical model. A simple bivariate poisson model (non-hierarchical) has also been attempted in 2. A model using the skellam distribution has also been attempted in 3. A comparison of a poisson scoring models with a goal shots model is compared in 4. There are many other papers for similar models, both in football and other sports and games competitions.

The ideas in this paper are mostly based on earlier works. The novel methods is applying TMB as a tool to implement the models, and comparing different time series models.

As a small disclaimer, I will mention that I have little understanding of football, and have not watched a single football match during the writing of this thesis. So any statements about football may be incorrect.

1.1 Tools

The main programming languages have been R and C++, with the TMB library. Results were stored in json format. Plots have been made using Python, with matplotlib, NumPy and SciPy. For more details

2 Theory

We will go through several statistical concepts; it will be assumed that the reader has some knowledge of linear algebra. We will not go through each theorem in depth, nor the derivation of them. For a comprehensive explanation, a textbook should be consulted.

2.1 Distributions

2.1.1 Normal and log-normal distribution

The normal2 distribution is a normal distribution, and is described by the density function in equation (1)

(1)Xf(x;μ,σ2)=N(μ,σ2)=σ2τexp(-12(x-μσ)2)dx
figure 1: Example normal distribution
example of a normal distribution
The standard normal distribution N(0,1) in [-4,4].

The normal distribution is fully parameterised by its mean and variance:

(2)E(X)=μ(3)Var(X)=σ2

With the (unbiased) estimators

(4)μ^=i=1nXin(5)σ^2=s2=i=1nXi-μ^n-1(6)σ^=c(n)i=1nXi-μ^n-1

where c(n) is a bias-correction factor, because E(s)<σ. For large samples this is close to one, so this factor is typically ignored, but exact and approximate values for the normal distribution can be found. 5 6 7

The log-normal distribution is similar, with a change of variables xlog(x):

(7)Y=exp(X)f(x;μ,σ2)=Lognormal(μ,σ2)=xσ2τexp(-12(log(x)-μσ)2)dy

We note that log(Y) is normal distributed, which we will make use of later, as the normal distribution is simpler.

The multivariate form of the normal distribution, often abbreviated to MVN, is

(8)Xf(x;μ,Σ)=N(μ,Σ)=|Σ|τkexp(-12(x-μ)Σ(x-μ))dx

with the parameters

(9)E(X)=μ(10)Var(X)=Σ

Σ is usually referred to as a covariance matrix.

We will consider a special parametrisation of the covariance matrix. The motivation is to removed restrictions on the entries. First of all, the covariance matrix is symmetric, so almost half of the entries are redundant when specifying the matrix. A more complex restriction, is that it must also be positive definite, i.e. (x-μ)Σ(x-μ)0 and its diagonal entries are strictly positive.

First we consider the relation between the covariance matrix and the correlation matrix:

(11)Ρ=diag(Σ)1/2Σdiag(Σ)1/2

where Ρ is the correlation matrix, which have the nice property that its diagonal consists of units. We can find a lower-triangular square root L, such that P=LL. However, L does not have a unit diagonal, for this we multiply by the inverse of its diagonal to obtain Θ=diag(L)L, as shown in equation (12)

(12)Θ=(1000θ1100θ2θ310θkθk+1θk+n1)

We let θ be the vectorisation of the lower triangular matrix, i.e. θ=(θ1,θ2,,θk+n).

We may also consider the elementwise root-log transform of the diagonal of Σ:

(13)log(σ)=(log(σ12),log(σ22),,log(σk2))=(log(σ1),log(σ2),,log(σk))

Thus, we may parametricise Σ as from the vector θlog(σ)T(k), where T(k) is the k-th triangular number by reversing the above steps. 8

2.1.2 Binomial distribution and the fundamental theorem of statistics

In determining the outcome of a match, we are either right or wrong. If we have a rule determining the correct outcome of a match with probability p, we have a binary distribution3.

Repeated determination of n matches, results in a binomial distribution, where we expect the number of the correct guessed outcomes m, such that m/np.

(14)f(n,k|p)=(nk)pk(1-p)n-k

The limiting distribution of a binomial variable Xn as n, can be approximated by a normal distribution. This theorem is sometimes referred to as the de moivre–laplace theorem, which is a special case of the central limit theorem4.

(15)Xn-npnp(1-p)limN(np,np(1-p))

Typically, this approximation is practical for n greater than 30.

The estimator for p, p^, is simply the number of correct guessed m over the total outcomes. m/n=p^. So the confidence interval, assuming normality, is of the form

(16)p[p^-zαc(n)p^(1-p^)/n,p^+zαc(n)p^(1-p^)/n]

We reiterate that this is unreliable for a small sample size, and other confidence intervals also exist. A list of other methods can be found in 9.

2.1.3 Poisson distribution

The poisson distribution is defined as the number of events occuring a fixed interval. It's described by the mass function in equation (17)

(17)Yf(y;λ)=Pois(λ)=λye-λy!
figure 2: Example poisson distribution
example of a poisson distribution
The left solid blue poles with round hats make up the score distribution of Eliteserien 2019. The right orange dashed poles with diamond hats are a fitted poisson distribution with λ1.46.

A remarkable property of the poisson distribution, is the simple relation between the mean and variance:

(18)E(Y)=Var(Y)=λ

2.1.4 Skellam distribution

The difference between two independent poisson distributed variables is distributed by the skellam distribution. 10 The skellam distribution is of the form show in equation (19)

(19)p(k|λ1,λ2)=e-(λ1+λ2)(λ1λ2)k/2I|k|(2λ1λ2)

where Ik is the modified bessel function of the first kind. The formula is complicated, but is shown below 11, page 375

(20)Iα(x)=i-αJα(ix)=m=0m!Γ(m+α)!(x2)2m+α

2.1.5 Discrete VAR(1) model

The VAR(1) series is the multivariate generalization of the one-dimensional AR(1) series, which is a special case of the AR(p) series:

(21)xt=c+ϕ1xt-1++ϕpxt-p+εt

where εtWN(0,Σε). 12, page 84 In our case, we consider p=1. We will also be assuming that c=0. The distribution is stable as long as |ϕ1|<1.

figure 3: Example discrete VAR(1) model
example of a discrete VAR(1) model
The conditional expectation and confidence interval of a Var(1) process, with ϕ=0.7, Σε0.71 and X0=2. Each interval at each point represents one, two and three standard deviations from the expected mean.

The unconditional expectation and covariance are given by 13

(22)E(xt)=c1-ϕ=μ(23)Var(xt)=Σε1-ϕ2

and the unconditional distribution of xt is

(24)xtN(μ,Σε1-ϕ2)

The conditional expectation and covariance are given by

(25)E(xt|xt-1)=c+ϕxt-1(26)Var(xt|xt-1)=Σw

and the conditional distribution of xt on xt-1 is given by

(27)xt|xt-1N(c+ϕxt-1,Σw)

The multivariate version of the unconditional and conditional distribution is: 14

(28)xtN(μ,vec(I-ϕ1ϕ1vec(Σw)))
(29)xt|xt-1N(c+ϕxt-1,Σw)

We will later be using equation (28) and equation (29), with zero-shifted mean.

2.1.6 Continuous VAR model

The continuous version5 is often attributed to Ornstein, Uhlenbeck and Vašíček. 15 16 The differential form and the formal solutions are shown in equation (30-31).

(30)dxt=θ(μ-xt)dt+σwdW(31)xt=x0e-θt+μ(1-e-θt)+σws=0te-θ(t-s)dWs

where the absolute value of the eigenvalues of θ should be strictly positive in the real part to be stable. 17, page 11 The relation between these two form can be found in the appendix, in equation (120-130)

figure 4: Example continuous VAR model
example of a continuous VAR model
The conditional expectation and confidence interval of a Var(1) process, with θ=-log(0.7)0.35, Σε=1 and x0=2. Each band represents one, two and three standard deviations from the expected mean. This is the continuous extension of figure 3.

The unconditional expectation and covariance are given by 18

(32)E(xt)=μ(33)Var(xt)=σw22θ

We will be assuming μ=0, but we state the general forms for completeness.

and the unconditional distribution of xt is 19

(34)xtN(μ,σw22θ)

The conditional expectation and covariance are given by 20

(35)E(xt|xt-1)=e-θΔtx0+μ(1-e-θΔt)(36)Var(xt|xt-1)=σw22θ(1-e-2θΔt)

and the conditional distribution of xt on xt-1 is given by 21, page 11

(37)xt|xt-1N(e-θΔtxt-1+μ(1-e-θΔt),σw22θ(1-e-2θΔt))

The multivariate version of the unconditional and conditional distribution is: 22

(38)xtN(μ,vec(θθvec(σwσw)))
(39)xt|xt-1N(e-θΔtxt-1+μ(I-e-θΔt),vec(θθ(I-e-θθΔt)vec(σwσw)))

As with the VAR(1) process, we will later be using equation (38) and equation (39), with zero-shifted mean.

The relation to the VAR(1) model is not immediately obvious, but equality can be shown with the following substitutions: 2324, page 8

(40)X0=x0(41)φ=e-θΔt(42)εtN(0,12θσw2(1-e-2θΔt))

so the following two equations are equivalent

(43)xt=e-θΔtxt-1+σws=0te-θ(t-s)dWs(44)Xt=φXt-1+εt

2.2 Change of variables in a density function

In some cases it may be useful to transform the variables, such as Y=g(X), because we may have more tools available for the transformed density. Such as log-transforming a log-normal variable, to obtain a normal variable.

(45)fY(y)dy=fg(Y)(g(y))d(g(y))=fX(x)dx

To be precise, this is only valid as long as g is a strictly increasing function. For the decreasing case we add a sign to either side; the two cases can be unified by applying the absolute value to both sides. 25

and the relation between fY and fX is given by

(46)fY(y)=fX(g(y))ddyg(y)

If we let Y=exp(X)Lognormal(μ,σ2), then

(47)fY(y)=ddylog(y)στexp(-12(log(y)-μσ)2)dy(48)=1yστexp(-12(log(y)-μσ)2)dy

And for the opposite case, X=log(Y)N(μ,σ2), then

(49)fX(x)=ddxexp(x)1exp(x)στexp(-12(log(exp(x))-μσ)2)dx(50)=exp(x)exp(x)στexp(-12(x-μσ)2)dx(51)=στexp(-12(x-μσ)2)dx

2.3 Likelihood function

The most general form of the likelihood function is defined as a mass/density function of a parameter θ given an outcome x

(52)L(θ|x)=pθ(x)

In many cases, x is a tuple of i.i.d. variables, and pθ(x) is a joint probability distribution of independent variables, and can be written as a product. So if x is a vector of size n, we have:

(53)L(θ|x)=i=1npθ(xi)

2.3.1 Maximization and log-likelihood

The likelihood function represents the probability of obtaining x for a given θ. A reasonable assumption is then that x is realized from a distribution where it has a high likelihood to be observed. So the θ that yields the highest likelihood is a natural candidate for determining the distribution of x. We seek to determine

(54)θ^=argmaxθΘL(θ|x)=argmaxθΘi=1npθ(xi)

Product are often cumbersome, and to simplify the above computation, we can apply the logarithm to L to obtain a sum. Because the logarithm is a strictly increasing function, the maximum of L is also the maximum of log(L). We denote this logarithm by :

(55)θ^=argmaxθΘ(θ|x)=argmaxθΘi=1npθ(xi)

From which we conclude that x most likely derived from the distribution p(x|θ^).

2.4 Hierarchical modelling and empirical bayes method

Bayesian6 hierarchical modelling is a way to describe a hierarchy of distributions, where the parameters of the upper layers are dependent on the distributions of lower layers.

The simplest such model is the two-stage hierarchical model, as shown in equation (56-58)

(56)Y|θ,ϕp(Y|θ,ϕ)(57)θ|ϕp(θ|ϕ)(58)ϕp(ϕ)

where Y is the observed data, θ is a parameter, and ϕ is a hyperparameter. p(θ,ϕ) is the prior distribution. In bayesian hierarchical modelling the hyperparameter is given a hyperprior, a distribution on the parameter, or p(ϕ).

In Empirical Bayes the hyperparameter is a fixed value. The estimation of this hyperparameter will be found using MLE.

The posterior theorem or bayes theorem is shown below in equation (59)

(59)f(θ|x)=f(x|θ)f(θ)f(x)=f(x|θ)f(θ)if(x|θi)f(θi)

The distribution for Y can then be found by marginalization7 over the parameters, or random effects.

(60)p(Y|ϕ)=p(Y|θ,ϕ)p(θ|ϕ)dθ=p(θ|Y,ϕ)p(Y|ϕ)ip(θ|Yi,ϕ)p(Yi)p(θ|ϕ)dθ

2.5 Bradley–terry model

The bradley–terry model 26 is used to make paired comparisons of individuals in a transitive way, using a single parameter.

(61)P(i>j)=pipi+pj=σ(βi-βj)

Where pi=eβi and σ(x)=1-exp(x) is the logistic function (a sigmoid function). The relation between the logit(x)=log(x1-x) logit function with the logistic function is that they are inverses, i.e. logit(x)=σ(x), so we also have the identity:

(62)logitP(i<j)=logit(σ(βi-βj))=βi-βj

An example where this model is used, is in elo ranking, most commonly known from chess. A player's rating is given by a number ri, which is related to βj by βi=log(10)/400·ri. So a player with r1=1700 playing against a player with r2=1900 will yield the following probabilities:

(63)P(1>2)=1-10(1900-1700)/40025%(64)P(1<2)=1-10(1700-1900)/40075%

A win or a loss updates the elo rating of each player, but the bradley–terry model has no mechanism for updating. Nor does it tell us how to initially calculate ratings, which must be found with inference.

We also note that the bradley–terry model does not handle ties, only binary win/loss outcomes.

The winner is the team with the highest score in a match. If the score is poisson distribution, we can determine the result by checking the value of yijHome-yijAway, and checking if it is strictly positive, zero or strictly negative.

2.5.1 Skellam distribution and the bradley–terry model

By defining the best team as having the score pM=1, we could use this to solve

(65)P(i<j)=pipi+pM

for any pi to obtain a ranking score for all teams. Instead we will use the point system.

2.6 Template model builder (TMB, R library)

Template model builder is an R/Cpp-library which greatly simplifies the means of estimating hierarchical models. 272829. It makes use of three key concepts: automatic differentiation, laplace method and the (generalized) delta method.

Understanding these concepts are not necessary to make use of the program, but they can be helpful.

2.6.1 Automatic differentiation and dual numbers

Dual numbers are 2-dimensional vectors with the unit vectors 1 and ε denoted by (a,b) or a+bε, with the added property that the square of the second component is identified with 0. 30, page 41-43

(66)ε2=0

This has applications for differentiation, where it can be used to find derivatives without differentiating. We define two dual vectors8 u=(u,u) and v=(v,v).

(67)u+v=(u,u)+(v,v)=(u+v,u+v)(68)u-v=(u,u)-(v,v)=(u-v,u-v)(69)uv=(u,u)(v,v)=(uv,uv+uv)(70)uv=(u,u)(v,v)=(uv,uv-uvv2)(71)f(u)=f(u,u)=(f(u),f(u)u)

We note the similarity between the second component and the rules from differentiation. While the first four rules are trivial, the last requires an explanation. This is found by the tangent expansion9 of the function:

(72)f(u)=f(u+uε)=i=0f(n)(u)(uε)nn!=f(u)+f(u)uε

The chain rule is also applicable:

(73)f(g(u))=f(g((u,u)))=f((g(u),g(u)u))=(f(g(u)),f(g(u))g(u)u)

By mapping a variable to x(x0,1) and a constant to c(c,0), any10 function can be differentiated by applying the chain rule until sufficiently elementary functions can be computed in order.

For a thorough introduction to the automatic differentiation, we refer to the paper of the stan math library. 31 However, this is only useful for understanding the underlying math of TMB, not for using TMB, so it can safely be ignored.

2.6.2 Laplace method

The second tool is the laplace method, which helps us approximate the marginal distribution and estimate the mean of the fixed parameters and random effects.

figure 5: Normal approximation
Normal approximation
A normal approximation of a Lognormal(0,0.12) distribution. The blue dashed area is the log-normal distribution, and the red plain area is the normal approximation. This was not computed using the method in this section, but the idea is similar.

The laplace method is based off the tangent series and a quadratic-exponential integral identity11.

(74)-e-a(x+b)2dθ=πa

We also assume that f(θ,u) achieves its maximum at u^, i.e. uf(θ,u^)=0, and that a=b= (or that the function decays sufficiently fast from θ^). And last, we assume that it achieves it's peak at u^, i.e. uu2f(θ,u^)<0.

The proof then follows.

(75)L(θ)=x=abL(θ|u)du(76)=x=abexp(M(Mexp(L(θ|u))))du(77)=x=abexp(Mf(θ,u))du(78)x=abexp(M(f(θ,u^)+uf(θ,u^)(u-u^)+12uu2f(θ,u^)(u-u^)2))du(79)=x=abexp(M(f(θ,u^)+12uu2f(θ,u^)(u-u^)2))du(80)=exp(Mf(θ,u^))x=abexp(12Muu2f(θ,u^)(u-u^)2)du(81)=exp(Mf(θ,u^))x=abexp(-12M|uu2f(θ,u^)|(u-u^)2)du(82)=limexp(Mf(θ,u^))-exp(-12M|uu2f(θ,u^)|(u-u^)2)du(83)=exp(Mf(θ,u^))π12M|uu2f(θ,u^)|(84)=exp(Mf(θ,u^))2πM|uu2f(θ,u^)|(85)=exp(Mf(θ,u^))τ-Muu2f(θ,u^)

We will use the special case of M=-1, thus f(θ,u)=-log(L(θ,u)) and

(86)-log(L(θ))-log(L*(θ))=-log(τ)+12log(uu2f(θ,u^))+f(θ,u^)

The multivariate form is slightly different. 32, equation 4TMB uses this to approximate the posterior distribution.

A related result is the posterior central limit theorem12. We leave out the details and the assumptions, but state the result, often attributed to Bernstein and Von Mises: 33 34

(87)p(u|θ)N(u^,-nI(u^))

where I(u^)=nuu2log(p(u|θ)) is the observed information13 (i.e. E(I)=I, where I is called the information) and n is the number of observations.

2.6.3 Delta method

The third tool is the (generalized) delta method,35, page 240-243 which helps us estimate the standard deviation of the fixed parameters and random effects.

The regular delta method states that if there exists a sequence of random variables Xn such that

(88)n(Xn-θ)DN(0,σ2)

where D denotes convergence in distribution, then for a differentiable function g, then

(89)n(g(Xn)-g(θ))DN(0,g(θ)2σ2)

The method used in TMB is a variant that approximates the distribution using the laplace method.

If the posterior distribution of λ|y is asymptotically normal with mode λ˜, then

(90)E(g(λ))=g(λ˜)+O(n-1)(91)Var(g(λ))=g(λ˜)Σg(λ˜)+O(n-2)

where n is the number of observations and Var(λ)Σ=-λ2log(p(λ=λ˜|y)) is the inverse of the negative hessian of the log posterior. Instead of the posterior, one may substitute this for L(λ|y)×p(λ), as the factor p(y) is constant (and thus its derivative zero). 36

The more general form where g(θ,u) also depends on the random effects u, TMB uses a more general estimate:

(92)Var(g(θ^,u^))g(θ^,u^)((H000)+JVar(θ^)J)g(θ^,u^)

where H=uu2f(u^(θ),θ), the random effects part of the hessian of the objective function, and J=Dθ(u^(θ),θ), the jacobian of (u(θ),θ) wrt. θ. 37

2.6.4 Bugs

TMB does have some quirks and unexpected behaviour. We list some of those here:

These are practically non-issues, which have simple workarounds, but which one should be aware of. TMB is mature enough to have practical applications, as we demonstrate.

2.7 Attack and defence model

The model we will be using will be a two-stage empirical poisson-log-prior hierarchical distribution. Read that twice. We will be using different distributions on the priors, but they will look similar.

The hierarchical model is shown below

(93)yijtPoisson(λijt)(94)log(λijt)=αit-βit+γxij+μ

Before we describe the priors for the different models, it's important to note the strengths and weaknesses of this model, so we can set realistic expectations of the model performance.

Some of these issues may be resolved by modifying the model, but with a limited data set, and no way to produce more, it's important to keep the model simple to prevent too much overfitting.

Now, for the priors, we make the assumption that these are discretely Var(1) distributed, or continuously VAR distributed.

2.7.1 Time independent model

The simplest model is letting the team parameters be constant throughout the season. We will not study this model in detail, but we mention it.

(95)(aibi)=wi

where wiN(0,Σ).

2.7.2 Discrete models

In the discrete case we can write the general function

(96)(aibi)t=Φ(aibi)t-1+wt

where the absolute value of the eigenvalues of Φ are smaller or equal than 1, and wtWN(0,Σ). In the initial case (unconditional case), we have

(97)(aibi)0=w0*

where w0*=N(0,Σ*), with Σ*=vec(I-ϕ1ϕ1vec(Σw)) in equation (28).

We consider three cases for conditions of ϕ:

2.7.3 Continuous models

In the continuous case we can write the general function

(98)(aibi)t=e-θΔtΦ(aibi)t-1+σws=0te-θ(t-s)dWs

where the absolute value of the eigenvalues of θ should be strictly positive in the real part to be stable. In the initial case (unconditional case), we have

(99)(aibi)0=σws=-0e-θ(-s)dWs

where the random integral has the distribution of N(0,Σ*), with Σ*=vec(θθvec(σwσw)) as in equation (38).

We consider two cases for conditions of θ:

2.8 Result score

2.8.1 Match

While the model itself describse the distribution of the score of a single team, that alone won't help us decide the match winner. For each match we sample from two poisson distributions, or from one skellam distribution.

As usual, the winner is the team with the highest score. If the score of each team is drawn from two poisson distributions, we get yti and ytj. It is then a simple matter of comparing the two, i.e. check which condition holds in ytiytj.

The equivalent condition in terms of the skellam distribution, is to define ktij=yti-ytj and check ktij0. So either of these may be used to determine the winner.

Winning a match gives the winning team three points, and the loser none. A tie gives each team one point. This system is known as three points for a win, and is common in football.

2.8.2 Season

Each team plays against every other team twice, in a double round robin system. If there are n teams, then each team plays 2(n-1) matches. The score for each match is added up to a final score, from which the overall seasonal winner is determined.

2.9 Likelihood function

After setting up the model, we want to find the optimal parameters

(100)L(γ,μ,Σ|Y,λ,α,β)=P(Y,λ,α,β|γ,μ,Σ)(101)=P(Y|λ,α,β,γ,μ,Σ)P(λ,α,β|Σ)(102)=P(Y|λ,α,β,γ,μ,Σ)P(α,β|Σ)(103)=i=0nP(yi|λ,α,β,γ,μ,Σ)P(α0,β0,Σ*)j=1mP(αj,βj|αj-1,βj-1,Σ)(104)=i=0nP(yi|λi,αi,βi,γ,μ,Σ)ϕΣ*(α0,β0)j=1mϕΣ(αj,βj|αj-1,βj-1)

In the independent case the product of the priors reduce to j=0mP(αj,βj|Σ).

With the accompanying log-likelihood:

(105)Poisson(λ,γ,μ,Σ|Y,α,β)=log(P(Y,α,β|λ,γ,μ,Σ))(106)=i=1nlog(P(yi|λi,αi,βi,γ,μ,Σ))+j=1mlog(ϕΣ(αj,βj))

By using TMB to apply the laplace approximation to the log-likelihood of the model, we obtain a function of (μ,σ,ϕ,Σ) (or θ in the continuous case), which we can maximize16. By maximization we obtain (μ^,σ^,ϕ^,Σ^), the arguments maximizing the likelihood, and (α^,β^), the mode of the posterior, which we use to find the expected value for λi for team i.

2.10 Measuring model quality

2.10.1 Average score parameter

The average of the score parameters for the matches given time-independent parameters is given by

(107)avg(λj)=1n-1i=1,ijnλi(108)=1n-1i=1,ijnexp(log(λi))(109)=1n-1i=1,ijnexp(μ+xiγ+αi+βi)(110)=1n-1exp(μ+αj)cosh(γ)(i=1nexp(βi)-exp(βj))

with E(avg(λj))eμ and Var(avg(λj))exp(2·0)(Σ11+Σ22)=trace(Σ). We remark that E(f(X))f(E(X)), so this is only an approximation, nor does it account for dependencies.

This value is not very interesting, as it doesn't help us determine the better team; all lamda-values are equal here. To get comparable lamdas, we instead look at λj|y. They don't have any nice looking expressions, so we instead use numeric methods to approximate the mean and the variance.

2.10.2 Model selection and fitting

Underfitting means to have a model that isn't sufficiently complex to model the target, i.e. the assumed model is too simple to accurately describe the target.

Overfitting is the opposite: having a model too complex for the target. This often tends to model the noise instead of the underlying distribution.

In both cases the fitted models fail to predict new data points. The aim in model selection is to find an optimal model that neither too strict or too flexible.

In figure 6 we have an example of a target distribution that is modelled by three different polynomials of different orders.

figure 6: Regression fitting
Example of underfitting, good fitting, and overfitting
The orange solid line is the true underlying third order polynomial curve. The shaded orange area is 1, 2, and 3 standard deviations off the mean. The orange dots are the observations, or data, on which we perform polynomial regression. The light blue dotted line is an underfitted model, and is too simple. The blue dashed third order curve is a good model, and follows a similar trajectory to the true curve. The dark blue dash-dotted eighth order curve is an overfitted model, as it assumes a too complex model, and is biased towards obervations rather than the true curve.

Visually, we can see that the dashed blue model is "best". To find this mathematically we use information criterion values, such as AIC.

2.10.3 Akaike information criterion

AIC is a goodness-of-fit value, giving a lower value for better models. 39 A related value, which we call AIC star, is shown in the below equation:

(111)AIC*=log(L^)-k

For small samples, one may subtract a correction term k(k+1)n-k-1 to obtain the corrected criterion AICc*, but as this is approximately zero for large samples (assuming k is small), we may ignore this.

The theoretical derivation of AIC includes the quantity -2log(L^), known as the deviance. So the usual definition of AIC is AIC=-2·AIC*. Because the right hand side of the expression becomes easier to interpret, and it doesn't affect the ranking of the models, we omit the factor -2.

For the model in figure 6, we have information criterions for the three fitted models, and for five other polynomial models in figure 7.

figure 7: Information criterion
Information criterion values for different polynomials
The values for log(L^) (light green circles), AIC* (green triangles) and AICc* (dark green diamonds). They have a peak at 8, 4 and 3, respectively, in [1,8]. The blue lines correspond to the fitted polynomials in figure 6 with the same pattern, indicating their fitness value.

Using the AICc*, we would correctly select the model with the same order as the target, but the coefficients would be different. The AIC would choose a fourth order curve, which is also close to the true order. Simply relying on the likelihood would have selected an overfitted model; this is the reason for introducing penalizing terms.

The usefulness of AIC comes from its simple assumptions. It doesn't assume anything about the model; as long as we know the likelihood and the number of parameters, we can calculate the criterion value.

2.10.4 Numerical simulation and estimation of parameters

Using the normal posterior distribution for the attack and defence parameters, we can simulate new values by drawing from the distributions. This can be used to numerically estimate the mean and deviations of the λs, and the scores.

From repeated sampling of a season's result, we may obtain a distribution for a team's score for the given attack/defence parameters.

listing 1: Simulating a season
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def season_result(A, gamma, mu):
scores = np.zeros(shape=teams)

result_point = {
-1 : 0,
0 : 1,
1 : 3,
}

for home, away, hround, around in matches:
hlamda = np.exp(A[hround,home,0] - A[around,away,1] + gamma + mu)
alamda = np.exp(A[around,away,0] - A[hround,home,1] - gamma + mu)

hscore = np.random.poisson(lam=hlamda)
ascore = np.random.poisson(lam=alamda)

result = np.sign(hscore - ascore)

scores[home] += result_point[ result]
scores[away] += result_point[-result]

return scores
Example code using Python to simulate the result of a season.

2.11 Prediction

2.11.1 Naive prediction

While not unexpected, home teams usually have an advantage in matches, often referred to as the home advantage. We model this by the γ parameter in our models. Statistically, the home team wins roughly 45 % of the matches in football. 40 In comparison, the home team won 47 % of matches in Eliteserien 2019. 41, so without any knowledge about the teams, a good strategy would be to just bet on the home team. This will be correct almost half the time.

2.11.2 Most likely result outcome

The match result is either a win, tie or a loss.

Using the estimated values λi for each team in a given match, we can calculate the loss probability PL=P(k>0|λi1,λi2), the tie probability PT=P(k=0|λi1,λi2), and the win probability PW=P(k<0|λi1,λi2), using the skellam distribution. We then select the most likely result as our guess.

This method has a flaw in that PT will always be smaller than either PL or PW, as long as λi1,λi21 so it will never guess a tie. The proof is simple: PT(λi1=1,λi2=1)<13, and is a maximum. I.e. increasing either λi1 or λi2 will make this value smaller. So either PL or PW must be greater than 13, and be our guess.

However, around a third of matches result in a tie, 42, but figure 2 shows the average of λi to be around 1.5, so unless the model is really accurate and can estimates below 1, this method will be wrong approximately one third of the time.

2.11.3 Most likely score outcome

The match score is a pair of number, e.g. 1-4, 2-2 or 7-1.

To make tie guesses more likely, we may want to use the mode instead. The mode represents the most likely score outcome, so instead of looking at what's most likely of a win, tie or loss, we look at each score outcome individually.

This makes sense because we fit the model to score, not the result of a match. However, it should more heavily favour ties, even when winning results combined would be more likely.

2.11.4 Weighted outcome

Another method that tries to correct the win bias, is to apply weights to the win, tie and loss probabilities. We then select the result with the highest weighted probability.

(112)(nWpW,nTpT,nLpL)(113)nW+nT+nL=1(114)(nW,nT,nL)0

3 Data analysis

3.1 Eliteserien 2019

The resulting scores from 2019 are taken from 43. The scores are displayed in table 1.

table 1: Results Eliteserien 2019

12345678910111213141516
1 Bodø/Glimt 1-22-23-04-00-03-23-05-12-01-13-32-04-02-14-0
2 Brann1-1 0-02-11-00-00-01-00-10-12-12-11-12-31-51-1
3 Haugesund1-11-1 0-00-20-01-24-10-12-11-13-02-25-11-01-4
4 Kristiansund1-21-02-2 5-24-03-21-10-02-24-00-11-21-04-22-0
5 Lillestrøm0-01-31-01-1 3-20-20-32-11-10-01-32-14-00-20-0
6 Mjøndalen4-52-11-41-12-2 1-32-03-11-20-01-01-11-11-11-0
7 Molde4-21-13-12-02-11-0 2-22-03-02-13-04-03-05-14-1
8 Odd3-13-23-12-02-13-22-2 1-01-13-02-12-12-11-01-1
9 Ranheim TF1-10-30-21-22-11-12-34-1 2-30-20-21-01-25-21-5
10 Rosenborg3-20-00-21-03-13-23-11-13-2 1-03-20-05-25-13-0
11 Sarpsborg 081-11-11-10-11-01-11-12-01-31-1 0-02-23-22-21-0
12 Stabæk2-00-11-12-01-14-21-20-00-03-13-3 2-10-10-01-1
13 Strømsgodset1-36-03-22-31-12-30-42-31-03-32-10-2 3-10-03-2
14 Tromsø1-21-22-25-01-12-22-11-24-21-02-01-10-1 0-20-0
15 Viking3-42-10-02-03-04-10-22-02-22-22-13-04-02-1 1-1
16 Vålerenga6-01-01-21-10-32-02-41-01-11-11-10-22-04-10-4
The result from 2019. The left number represents the home team score (row), the right number represents the away team score (column). The numbers in the column header correspond to the teams with the same number in the row header.

3.1.1 Data quirks

After using the data, a minor inconsistency was found: Rounds are not in order, so the number of games each team have played up until a match may differ. This is due to scheduling issues, so "round 2" may be moved to after "round 12", but round numbering is not renamed to account for this.17

4 Results

We first look at the estimated results for the rankings. These are summary statistics of the predictions of each match, which we will go through next.

4.1 Ranking

The estimated ranks are shown below. The first table is the time-independent model, shown in figure 8. The next three are the discrete models, shown in figure 9, figure 10 and figure 11. The last two are the continuous models, shown in figure 12 and figure 13.

We note that most estimated medians are drawn to around 45 points, away from the extremes.

4.1.1 Discrete models

figure 8: Time independent noise ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season.
figure 9: Discrete WN ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season. Apparently this model were ill-defined, as the deviation was undefined.
figure 10: Discrete VAR(1) ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season.
figure 11: Discrete RW ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season.

4.1.2 Continuous models

figure 12: Continuous VAR(1) ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season.
figure 13: Continuous RW ranking
Box plot of the estimated points from 100K simulated seasons. The orange line is the median, and the blue cross is the actual points the team got in the season.

4.1.3 Comparison of models

In table 2 we see the estimated parameter values for each model. In table 3 we have the AIC values for each model. By this measure, we see that the discrete random walk model to be the better one.

table 2: Parameter comparison
T.I.ModelParameterEstimateStandard error
Noise μ 3.349e-01 5.138e-02
γ 1.926e-01 3.852e-02
Σw (3.759e-021.313e-021.313e-025.694e-03) (2.067e-021.138e-021.138e-021.031e-02)
DiscreteModelParameterEstimateStandard error
White noise μ 3.181e-01 5.348e-02
γ 1.927e-01 3.863e-02
ϕ 0 0
Σw (2.083e-021.709e-031.709e-032.988e-02) (NaN2.446e-022.446e-02NaN)
Vector autoregressive μ 3.285e-01 5.393e-02
γ 1.925e-01 3.857e-02
ϕ (1.025e+005.838e-02-1.401e-018.442e-01) (1.058e-019.009e-022.460e-012.085e-01)
Σw (1.274e-03-4.291e-04-4.291e-041.445e-04) (2.802e-038.156e-048.156e-045.687e-04)
Random walk μ 3.181e-01 5.319e-02
γ 1.927e-01 3.863e-02
ϕ 1 0
Σw (3.484e-038.236e-048.236e-041.947e-04) (1.780e-034.143e-044.143e-049.643e-05)
ContinuousModelParameterEstimateStandard error
Vector autoregressive μ 3.181e-01 5.348e-02
γ 1.927e-01 3.863e-02
θ (2.302e-06-9.381e-06-9.381e-063.975e-05) (1.043e-034.286e-034.286e-031.815e-02)
D (5.889e-101.274e-101.274e-103.029e-11) (2.380e-075.144e-085.144e-081.234e-08)
Random walk μ 3.215e-01 5.134e-02
γ 1.923e-01 3.863e-02
θ 0 0
Σw (4.459e-041.081e-041.081e-042.620e-05) (2.289e-045.465e-055.465e-051.230e-05)
T.I. is the time independent model. These are the estimated parameters for each model. We have omitted the hundreds of values of α and β, but they were of similar order of magnitude as μ and γ. While μ and γ are significantly different from 0, the same cannot be said for the parameters of the time series.
table 3: Information criterion value comparison
T.I.ModelAIC
Noise 1462.915
DiscreteModelAIC
White noise 1476.872
Vector autoregressive 1469.526
Random walk 1461.973
Cont.ModelAIC
Vector autoregressive 1467.973
Random walk 1462.855
T.I. is the time independent model. Cont. are the continuous models. The best model, according to the calculated values of AIC, is discrete random walk model.

4.1.4 Ranking

Using the discrete random walk model, we can obtain the expected final standings, as shown in table 4. Only the top three teams were correctly predicted, however, most other scores weren't statistically different. With three teams with 40 points and four teams with 30 points, discrepencies ought to be expected.

table 4: Final standings

Rank2019 resultTMB RW prediction
1Molde (68)Molde (54.707)
2Bodø/Glimt (54)Bodø/Glimt (49.680)
3Rosenborg (52)Rosenborg (48.989)
4Odd (52)Viking (47.537)
5Viking (47)Stabæk (42.080)
6Kristiansund (41)Haugesund (41.905)
7Haugesund (40)Odd (41.779)
8Stabæk (40)Kristiansund (40.890)
9Brann (40)Strømsgodset (39.674)
10Vålerenga (34)Tromsø (38.654)
11Strømsgodset (32)Mjøndalen (37.138)
12Sarpsborg 08 (30)Vålerenga (37.105)
13Mjøndalen (30)Ranheim TF (36.674)
14Lillestrøm (30)Sarpsborg 08 (35.836)
15Tromsø (30)Lillestrøm (35.748)
16Ranheim TF (27)Brann (35.538)
Because most teams almost got the same number of points, they wouldn't be statistically different. There are also special rules for deciding the ranking of tied point scores, but we ignore this as we use real-valued points, truncated to three decimals.

4.2 Predicting the past

There are two ways to estimate results. One way is to use data from the entire season, and "predict the past". Or we can use all matches up to a certain date and predict the following match(es), and "predict the future". We first present the past predictions.

For these predictions we have used the continuous VAR model, for no particular reason; the discrete random walk model might have been a better choice.

For the most likely result rule, we can see the predicted result of each match in table 5. The table is very information dense, so we have the confusion matrixes below.

table 5: Past predicted results

12345678910111213141516
1 Bodø/Glimt HHHHHHHHHHHHHHH
2 BrannA HHHHAHHHHHHHAH
3 HaugesundHH HHHHHHHHHHHHH
4 KristiansundHHH HHAHHHHHHHHH
5 LillestrømAHHH HAHHHHHHHAH
6 MjøndalenHHHHH AHHAHHHHHH
7 MoldeHHHHHH HHHHHHHHH
8 OddHHHHHHA HHHHHHHH
9 Ranheim TFAHHHHHAH AHHHHHH
10 RosenborgHHHHHHHHH HHHHHH
11 Sarpsborg 08AHHHHHAHHA HHHAH
12 StabækHHHHHHAHHHH HHHH
13 StrømsgodsetAHHHHHAHHHHH HHH
14 TromsøAHHHHHAHHHHHH HH
15 VikingHHHHHHHHHHHHHH H
16 VålerengaAHHHHHAHHAHHHHA
The predicted result given by the most likely result rule. Correct prediction are marked with a green cell (same shade as the top-right corner) on the web or with a v in the printed version. Incorrect predictions are orange or marked with an x.

We see the confusion matrixes of the past predictions below in table 6, table 7 and table 8. For the weighted rule, we found the weights (0.5,0.3,0.2), so this weighs loss probabilites more.

table 6: Most likely result confusion matrix

Predicted
Away Tie Home
Actual Away 13041
Tie 10063
Home 30110
The confusion matrix using the entire dataset to to model and predict the matches using the most likely result rule. (e.g. win or lose.) The diagonal are the correct guesses; around 51 % were correct.

table 7: Most likely score confusion matrix

Predicted
Away Tie Home
Actual Away 1467
Tie 14923
Home 05063
The confusion matrix using the entire dataset to to model and predict the matches using the most likely score rule. (e.g. 1-1 or 3-2.) The diagonal are the correct guesses; around 47 % were correct.

table 8: Weighted result confusion matrix

Predicted
Away Tie Home
Actual Away 30024
Tie 24049
Home 17096
The confusion matrix using the entire dataset to to model and predict the matches using a weighted result rule. The diagonal are the correct guesses; 52 % of the total were correct.

The weighted score outperformed the two other rules, but the weights are likely to be biased, and may not apply to other datasets.

4.3 Predicting the future

The future predictions involved fitting the model to all matches before a certain date, and make a prediction for the next match using the fitted model. So the model parameters change for each prediction.

As with the past predidction, we have the predicted result of each match in table 9 using the most likely result rule. Most of this table is uninteresting, but we note the first two matches between Odd-Brann and Vålerenga-Mjøndalen. Because we have no prior knowledge of their performance, we are unable to make predictions for these matches, so we only predict 238 matches.

table 9: Future predicted result

12345678910111213141516
1 Bodø/Glimt HHHHHHHHHHHHHHH
2 BrannH HHHHAHHHHHHHAH
3 HaugesundHH HHHHHHAHHHHHH
4 KristiansundHHH AHHHHHHHHHHH
5 LillestrømAHHH HHHHHHHHHHH
6 MjøndalenHHHHH AHHAHHHHHH
7 MoldeHHHHHH HHHHHHHHH
8 OddHNAHHHHH HHHHHHHH
9 Ranheim TFAHHHHHHA HHHHHHH
10 RosenborgHHHHHHHHH HHHHHH
11 Sarpsborg 08HHHHHHHHHH HHHHH
12 StabækHHHHHHAHHHH HHAH
13 StrømsgodsetHHHHHHHAHHHH HHH
14 TromsøHHHAHHHHHHHHH HH
15 VikingHHHHHHHHHHHHHH H
16 VålerengaHHHHHNAAHHHHHHHH
The predicted result given by the most likely result rule. Correct prediction are marked with a green cell (same shade as the top-right corner) on the web or with a v in the printed version. Incorrect predictions are orange or marked with an x.

While we are mostly interested in the proportion of correct guesses at the end of the season, it's also interesting to see how this evolves during the season, as shown in figure 14. It's relatively stable, but weaker than the past prediction methods.

figure 14: Precision

Most likely result precision over time
Precision over time. The normality assumption is violated near the first day, as the interval extends beyond 1, and p must be contained in [0,1]. The confidence interval shrinks with more observations (matches). The intervals are four standard errors long.

We see the confusion matrixes of the future predictions below in table 10, table 11 and table 12. We use the same weights for the weighted rule as we did for past predictions, i.e. (0.5,0.3,0.2); this introduces some bias. This can be remedied by updating these weights for each round as well.

table 10: Most likely result confusion matrix

Predicted
Away Tie Home
Actual Away 6048
Tie 4069
Home 40107
The confusion matrix using the entire dataset to to model and predict the matches using the most likely result rule. (e.g. win or lose.) The diagonal are the correct guesses; around 47 % were correct.

table 11: Most likely score confusion matrix

Predicted
Away Tie Home
Actual Away 04113
Tie 04825
Home 06843
The confusion matrix using the entire dataset to to model and predict the matches using the most likely score rule. (e.g. 1-1 or 3-2.) The diagonal are the correct guesses; around 38 % were correct.

table 12: Weighted result confusion matrix

Predicted
Away Tie Home
Actual Away 9045
Tie 15058
Home 16095
The confusion matrix using the entire dataset to to model and predict the matches using a weighted result rule. The diagonal are the correct guesses; 43 % of the total were correct.

We note that all the results of the future predictions are worse than the result of past predictions. This is expected, as we have less data to rely on.

5 Discussion & conclusion

We have seen that football matches are hard to predict, but that TMB is a useful tool in determining this. TMB made it easy to create and fit multiple models, without prior knowledge of dual numbers or laplace approximation, but simply through the likelihood function.

While the time series model had some issues, and most not better than a time independent model (as shown in table 3), they were interesting to study and model.

5.1 Further work

This paper has mostly been exploratory, and points to several directions that can be explored. We go through a few paths one may use for continued study.

We can change the values we fit the model to. A drawback of the points system in football is that wins are given extra weight with three points. So a single goal may add two points. When fitting, we tuned our parameters to the goal counts, not the result (win, tie, loss). One could instead use the result for fitting.

The model can be extended to include multiple seasons. This may improve the prediction precision, and get more accurate parameter estimates. Though, this may be difficult as some teams are removed and new ones added every season.

While predicting match and standing results is interesting, there are other results which would be interesting to predict, such the odds. These are usually also available from betting sites, and could be interesting to compare.

On the other end, it may be "obvious" to include more data to fit the model, such as player age, or match length, travel distance between matches, they may also be be redundant, and lead to overfitting.

Variables such as seasonal change were unaccounted for; while this falls under the same category as being prone to overfitting, it's possible for a team to progressively become better or worse during a season.

During the idea stage of the paper, the idea of intransitive ordering between teams came up. This would have been interesting to study further.

6 Appendix

6.1 Kronecker product and sum

The kronecker product and sum are useful for solving matrix equations. The product has a simple definition, but the sum is more complicated, and is related to the product by the matrix exponential.

(115)AB=(a11Ba1nBam1BamnB)
(116)(abcd)(efgh)=(a(efgh)b(efgh)c(efgh)d(efgh))=(aeafbebfagahbgbhcecfdedfcgchdgdh)

Before we define the sum, we will give two more examples to show that kronecker product is not commutative:

(117)(abcd)I=(a0b00a0bc0d00c0d)(118)I(efgh)=(ef00gh0000ef00gh)

While the left hand sides are different in the two equations, the structure of the resulting matrix is clearly different.

The kronecker sum is defined as AB=AI+IB.

(119)(abcd)(efgh)=(a0b00a0bc0d00c0d)+(ef00gh0000ef00gh)=(a+efb0ga+h0bc0d+ef0cgd+h)

This is notably not commutative either, which is unexpected for something called a sum.

The relation to the matrix exponential is given by exp(A)exp(B)=exp(AB).

6.2 Solving the continuous VAR(1) differential

(120)d(xteθt)=θxteθtdt+eθtdxt(121)=θxteθtdt+eθt(θ(μ-xt)dt+σdWt)(122)=θxteθtdt+θμeθtdt-θxteθtdt+σeθtdWt(123)=θμeθtdt+σeθtdWt(124)s=0td(xseθs)=s=0t(θμeθsds+σeθsdWs)(125)=s=0tθμeθsds+s=0tσeθsdWs(126)=μs=0td(eθs)+σs=0teθsdWs(127)[xseθs]s=0t=μ[eθs]s=0t+σs=0teθsdWs(128)xteθt-x0=μ(eθt-1)+σs=0teθsdWs(129)xteθt=x0+μ(eθt-1)+σs=0teθsdWs(130)xt=x0e-θt+μ(1-e-θt)+σs=0te-θ(t-s)dWs

6.3 TMB workflow

listing 2: Example program

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Data values
data <- list(...)

# Random & fixed effects
parameters <- list(...)


# Objective function
MODEL = "discrete_rw"

# Compile and link the template
. <- TMB::compile(paste0("models-tmb/", MODEL, ".cpp"))
dyn.load(TMB::dynlib(paste0("models-tmb/", MODEL)))


# Make Automatic Differentiation Function
obj <- TMB::MakeADFun(data, parameters, random=c("A"), DLL=MODEL, silent=TRUE)

# NonLinear MINimization subject to Box constraints
system.time(opt <- nlminb(obj$par, obj$fn, obj$gr))


# Result summary
report <- TMB::sdreport(obj)
Excerpt of a TMB program. The ellipsis would be replaced with the model values and parameters. A complete example may be found in the TMB paper 44, page 11-13, its documentation 45, chaper 7 Example collection. For the program used in this thesis, we refer to its github repository, 46.