Multinomial logit models are a workhorse tool in marketing, economics, political science, etc. One easy and flexible way to estimate these models is in Stan. The reason I like Stan is that it allows you extend beyond the standard multinomial logit model to hierarchical models, dynamic models and all sorts of fun stuff. (If you are interested in hierarchical-Bayes, you should also check out the tutorial that Kevin Van Horn and I wrote.)

However, for beginners, the learning curve can be a bit steep. I set many students to the task for fitting a basic multinomial logit model in Stan and they usually master the Stan syntax get the modeling running pretty quickly. But often the posterior draws and estimates they are getting from Stan don’t make sense.

Almost always, the problem is that they have misspecified the intercepts in the model. You see, multinomial logit models are a bit quirkly in that *relative* differences in utility between alternatives are identified, but not absolute differences in utility. In practice, this means that if you observe choices from among `K`

alternatives, then you can only estimate `K-1`

intercept parameters.

This series of examples illustrates what happens when you specify `K`

intercept parameters. I then show you several different ways to specify `K-1`

intercept parameters. Along the way, I hope to teach you something about how to test Stan code to see that it recovers the parameters of synthetic data (but see Michael Betancourt’s blog for a much more through treatment) and how to diagnose identification problems.

Before we get started, here are the packages and options I use:

```
library(rstan)
library(bayesplot)
library(gtools) # rdirchlet
rstan_options(auto_write=TRUE) # writes a compiled Stan program to the disk to avoid recompiling
options(mc.cores = parallel::detectCores()) # uses multiple cores for stan
```

## Data generation for multinomial logit model

The data setup I’ll be working with is typical of many marketing problems where we observe a series of choices from a fixed set of alternatives (often brands of fast-moving consumer goods) and want to estimate intercepts for each of the alternatives (which we sometimes interpret as “Brand Equity”.

I start every modeling project by writing a function to simulate data from the model. Here is a handy function for generating synthetic data from the multinomial logit.

```
generate_mnl_data <- function(N=1000, K=2, J=3, beta=c(1, -2), alpha=c(1,0,-1)){
if(length(beta) != K) stop ("incorrect number of parameters")
Y <- rep(NA, N)
X <- list(NULL)
for (i in 1:N) {
X[[i]] <- matrix(rnorm(J*K), ncol=K)
Y[i] <- sample(x=J, size=1, prob=exp(alpha+X[[i]]%*%beta))
}
list(N=N, J=J, K=K, Y=Y, X=X, beta=beta, alpha=alpha)
}
d0 <- generate_mnl_data()
```

This function creates data where the alternatives for each choice task are stored as a matrix. For instance in `d0`

, there are `J=3`

alternatives each with `K=2`

attributes in every choice task. These matrices are stored together in a list of length `N`

for the `N`

choice observations. The attributes for the first observed choice are:

`d0$X[[1]]`

```
## [,1] [,2]
## [1,] 0.95298844 -1.748092
## [2,] -1.34196209 1.422230
## [3,] -0.02610931 0.339079
```

We also have a vector `Y`

that indicates which alternative was chosen for each of the `N`

choices. In the first choice, alternative 1 was chosen:

`d0$Y[1]`

`## [1] 1`

## Basic multinomial logit model without intercepts

We start by estimating a basic multinomial logit model without intercepts. The Stan code below specifies that model with uniform priors on the parameters.

```
m1 <-
"data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[K] beta; // attribute effects
}
model {
for (i in 1:N)
Y[i] ~ categorical_logit(X[i]*beta);
}"
```

We can create some synthetic data using `generate_mnl_data()`

and then pass it into the `stan()`

function along with our model code `m1`

to produce draws from the posterior:

```
d1 <- generate_mnl_data(N=1000, beta=c(-1,1), alpha=c(0,0,0))
p1 <- stan(model_code=m1, data=d1, iter=1000, chains=2, seed=20030601)
```

From the Stan output above, we can see that this code is very fast (<3 sec / 1000 interaations for N=1000) and does not produce any warnings. We can make a traceplot of the draws for `beta`

to see that Stan’s HMC algorithm is performing well:

`mcmc_trace(As.mcmc.list(p1, pars="beta"))`

We can also see that the the “true” beta parameters are well within the posterior range, suggesting that the code is working. (There are more through tests we could do, but we’ll skip those for now. (Again, see Betancourt’s blog for more detail.)

`mcmc_recover_hist(As.mcmc.list(p1, pars="beta"), true=d1$beta)`

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

## The bad way to do MNL with intercepts

The model we estimated above does not include intercepts. In a logit model for data like this, we might want to specify an intercept for each alternative, indicating how much each alternative is preferred. These are sometimes called “alternative specific constants.” I’m going to start be naively specifying a vector `alpha`

, which contains the three intercepts for the three alternatives. Like any other intercepts, I add them to the linear preditor: `alpha + X[i]*beta`

. (If you are having trouble with the matrix algebra, remember `alpha`

is 3 x 1, `X`

is 3 x 2 and `beta`

is 2 x 1. The resulting linear predictor is 3 x 1, i.e. a utility for each alternative.)

```
m2 <- "
data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[J] alpha; // unconstrained intercepts
vector[K] beta;
}
model {
for (i in 1:N)
Y[i] ~ categorical_logit(alpha + X[i]*beta);
}"
```

Then we can generate data and run the model:

```
d2 <- generate_mnl_data(N=1000, beta=c(1,-1), alpha=c(1, 0, -2))
p2 <- stan(model_code = m2, data=d2, iter=1000, chains=2, seed=19980103)
```

```
## Warning: There were 744 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 1.61, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
```

Uh-oh. This looks bad! The code runs very slowly runs slowly (~10 min / 1000 draws for `N=1000`

) with lots of warnings from Stan about how the HMC sampler is performing. The traeplots below show that the values of the intercepts in `alpha`

don’t coverge; they just wander around to any value. The `beta`

s seem to converge, but are mixing poorly. What’s going on here?

`mcmc_trace(As.mcmc.list(p2, pars=c("alpha", "beta")))`

The problem is that we can add any value to all three intercepts in `alpha`

and the conditional logit likelihood will be the same. This means that `alpha`

is not identified.

## MNL with prior on intercepts

We are Bayesian, so one way to solve this identification problem is to put informative priors to identify intercepts. Here I specify a fairly tight normal prior for `alpha`

which will keep all three values of `alpha`

fairly close to zero. (Actually, since we didn’t specify a prior in the models above, they all have an implicit uniform prior on `alpha`

and `beta`

. This is a really bad prior, but it didn’t matter because the *data* identified the parameters.)

```
m3 <- "
data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[J] alpha;
vector[K] beta;
}
model {
alpha ~ normal(0,1); // prior contrains alpha
for (i in 1:N)
Y[i] ~ categorical_logit(alpha + X[i]*beta);
}"
```

And now when I run the model, things go much more smoothly.

`p3 <- stan(model_code = m3, data=d2, iter=1000, chains=2, seed=19730715)`

This runs faster (~30s / 1000 iterations) without producting errors and we can see from traceplots that that the values of `alpha`

are now converging and the `beta`

parameters are mixing nicely:

`mcmc_trace(As.mcmc.list(p3, pars=c("alpha", "beta")))`

And now it looks like the true values we used to generate the data fall within the estimated posteriors.

`mcmc_recover_hist(As.mcmc.list(p3, pars=c("alpha", "beta")), true=c(d2$alpha, d2$beta))`

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

Hurray for priors!

Well, sort-of. This solution works with this data, but you may have to adjust the priors on `alpha`

depending on your data. That would be a hassle and it generally is better practice to fit models that are fully-identified only by the model and the data.

## MNL with fix-one-to-zero constraint

Instead of using priors to constrain the intercepts, a more common identification strategy is to fix one of the intercepts in `alpha`

to zero. (This also works if you are using other estimation procedures like maximum likelihood.) This is accomplished in Stan creating a vector of length `J-1`

which we will call `alpha_raw`

and then appending a zero to it in the `transformed parameters`

block.

```
m4 <- "
data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[J-1] alpha_raw;
vector[K] beta;
}
transformed parameters {
vector[J] alpha;
alpha = append_row(0, alpha_raw);
}
model {
for (i in 1:N)
Y[i] ~ categorical_logit(alpha + X[i]*beta);
}"
```

This is identified (with just the default uniform priors) and runs quite fast (~7s per 1000 iterations with N=1000).

`p4 <- stan(model_code = m4, data=d2, iter=1000, chains=2, seed=19730715)`

The traceplots now look great even without specifying the priors: Note that all the values for `alpha[1]`

are zero, because that’s how we defined `alpha`

in the `tranformed parameters`

block.

`mcmc_trace(As.mcmc.list(p4, pars=c("alpha")))`

Now, one thing that may be bothring you is that the estimates for alpha are *not* the values we used to generate the data. The estimates are actually *differences* between the arbitrarily chosen intercept that was set to zero and the other intercepts. We can compute these differences and then compare them to the posteriors.

```
shifted_alpha <- d2$alpha - d2$alpha[1]
mcmc_recover_hist(As.mcmc.list(p4, pars="alpha"), true=shifted_alpha)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
## Warning: Computation failed in `stat_bin()`:
## `binwidth` must be positive
```

All looks good and it would be perfectly fine to use this sampler. In fact, this is the specification the `mlogit`

package and in most of the commercial software for the multinomial logit. I see lots of papers in marketing and econ that use this specification.

But it isn’t what I use in my own work. The tricky thing about setting one value of `alpha`

to zero is that it is becomes harder to put priors on `alpha_raw`

. Our specirfication defines `alpha_raw`

as the `differences`

in utility. To put priors on `alpha_raw`

, you have to specify your beliefs about the *differences* between the alternatives. That can be very hard to do, especially when you are using hierarchical priors to account for heterogeneity in the intercepts across different decision makers. (It is especially bad, if you are trying to compute Bayesian d-optimal designs, as Kevin Van Horn and I once found out the hard way.)

## MNL with sum-to-zero constraint (but with a very bad Stan implementation)

Instead, I prefer to use a sum-to-zero constraint on `alpha`

, which avoids making one of the alternatives *special*. The idea behind a sum-to-zero constraint is that we want to constrain `sum(alpha) = 1`

. Here is one very bad way to implement it in Stan. It does recover the true values of the parameters, but because the parameter `alpha_raw`

is unidentified, the HMC sampling is slow and generates lots of warnings.

```
m5 <- "
data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[J] alpha_raw; // unconstrained UPC intercepts
vector[K] beta;
}
transformed parameters{
vector[J] alpha;
alpha = alpha_raw - (1.0/J) * sum(alpha_raw); // sum to zero constraint
}
model {
// model
for (i in 1:N)
Y[i] ~ categorical_logit(alpha + X[i]*beta);
}"
p5 <- stan(model_code = m5, data=d2, iter=1000, chains=2, seed=19730715)
```

```
## Warning: There were 816 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 1.99, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
```

We are getting the true values of `alpha`

back (after shifting them to account for the sum-to-zero constraint):

`r shifted_alpha = d2$alpha - (1/d2$J)*sum(d2$alpha) mcmc_recover_hist(As.mcmc.list(p5, pars="alpha"), true=shifted_alpha)`

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

but these traceplots for `alpha_raw`

look awful and it took a long time to get them! (For those of you who have build Gibbs samplers, it actually works very well to sample the unidentified `alpha_raw`

in a Gibbs sampler, but for HMC, not so much.)

`mcmc_trace(As.mcmc.list(p5, pars=c("alpha_raw", "alpha")))`

## MNL with sum-to-zero constraint (better implementation)

Here is a better implementation of the sum-to-zero constraint. Let’s specify `alpha_raw`

as a `J-1`

vector. By putting this `J-1`

parameter in the `parameters`

block, we are telling Stan to focus on estimating the `J-1`

identified parameter in `alpha_raw`

. Then we will create an `J`

vector `alpha`

in the transformed parameters, but instead of putting in 0 for the left-out parameter in `alpha`

, we will use `-sum(alpha_raw)`

. This results in an `alpha`

that sums to zero, which we then use to compute the `categorical_logit`

likelihood. This runs fast and smoothly.

```
m6 <- "
data {
int<lower=2> J; // of alternatives/outcomes
int<lower=1> N; // of observations
int<lower=1> K; // of covariates
int<lower=0,upper=J> Y[N];
matrix[J,K] X[N];
}
parameters {
vector[J-1] alpha_raw; // unconstrained UPC intercepts
vector[K] beta;
}
transformed parameters{
vector[J] alpha;
alpha = append_row(-sum(alpha_raw), alpha_raw); // sum to zero constraint
}
model {
for (i in 1:N)
Y[i] ~ categorical_logit(alpha + X[i]*beta);
}"
p6 <- stan(model_code = m6, data=d2, iter=1000, chains=2, seed=19730715)
```

And we can see the true parameters are recovered (after we shift them to account for the sum-to-zero constraint).

```
shifted_alpha = d2$alpha - (1/d2$J)*sum(d2$alpha)
mcmc_recover_hist(As.mcmc.list(p6, pars="alpha"), true=shifted_alpha)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

Surprisingly, this specification does not treat the first alternative as special. Instead, the parameters of `alpha`

are the difference in utility between the alternative and the average of all alternatives. (You can see that in how I defined `shifted_alpha`

above.) So, when you specify a prior on the elements of `alpha`

, you can think about how much of a difference you expect to see between any alternative and the average alternative; this works very well in hierarchical models.

So, that’s how to identify intercepts in multinomial logit models. Along the way we learned that:

Before using any estimation procedure on real data, you should run it on synthetic data (simulated from the same model) to make sure it is recovering the parameters correctly. Better yet, run it on many different synthetic data sets. (See Betancourt.)

A slow running Stan sampler with lots of errors is a sign that your model is poorly-specified and may have an indentification issue.

Indentification can be achieved by specifying tighter priors or by applying constraints to the model.

Before I go, I should mention that way we have modeled the intercepts (alpha) is exactly how we would model *any multivariate attribute* in a choice model. So, if you have an attribute with `K`

levels, you will have `K-1`

parameters for it and you can accomplish the identification using the strategy above. Or, you could use effects coding, which achieves the same sum-to-zero constraint by way of the coding of multivariate attributes in the X matrix.