Help the Stat Consulting Group by giving a gift

Zero-Inflated Negative Binomial Regression

This page shows an example of zero-inflated negative binomial regression analysis with footnotes
explaining the output in Stata. The data collected were academic information on
316 students at two different schools. The response variable is days absent during the school year (**daysabs**).
We explore its relationship with math standardized test scores (**mathnce**),
language standardized test scores (**langnce**), and gender (**female**).

As assumed for a negative binomial model, our response variable is a count
variable and the variance of the response variable is greater than the mean of
the response variable. Sometimes when analyzing a response variable that is a
count variable, the number of zeroes may seem excessive. With the example dataset in mind, consider the
processes that could lead to a response variable value of zero. A student might
be absent zero days during the school year if he never gets sick and never skips
school. Another student might be absent zero days during the school year because
her parents insist she go to school every day, regardless of illness or desire
to skip school. These two students will look identical in the response variable,
but they have arrived at the same outcome through two different processes.
The first student *could* have been absent during the school year (had he
become ill or opted to skip school), but was not. The second student was
certain to be absent zero days. The second student will be referred to
from this point forward as a "certain zero".
Thus, the number of zeroes may be inflated and the number of students absent for
zero days cannot be explained in the same manner as the number of students that
were absent for more than zero days. Some students were absent zero days
for the same reasons other students were absent one, two, or three days (health
and truancy) and while some students were absent zero days for a different set
of reasons.

A standard negative binomial model would not distinguish between these two processes, but a zero-inflated model allows for and accommodates this complication. When analyzing a dataset with an excessive number of outcome zeros and two possible processes that arrive at a zero outcome, a zero-inflated model should be considered. We can look at summary statistics to try to gauge if the data are over-dispersed and a histogram of the response variable to see if the number of zeros is excessive. (If two processes generated the zeroes in the response variable, but there is not an excessive number of zeroes, a zero-inflated model may or may not be used.)

use http://www.ats.ucla.edu/stat/stata/notes/lahigh, clear

generate female = (gender == 1)

summarize daysabs

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- daysabs | 316 5.810127 7.449003 0 45

From here, we can see that the variance of our outcome (the standard deviation squared) is larger than the mean.

histogram daysabs, discrete freq

While zero is the most common number of days absent, it is difficult to see from this histogram if the number of zeroes is in excess of what we would expect from a negative binomial model. Thus, we can run a zero-inflated negative binomial model and test whether it better predicts our response variable than a standard negative binomial model.

The zero-inflated negative binomial regression generates two separate models and then
combines them. First, a logit model is generated for the "certain zero" cases
described above, predicting whether or not a student would be in this group. Then, a
negative binomial model is generated predicting the counts for those students who
are not certain zeros. Finally, the two models are combined. When
running zero-inflated negative binomial in Stata, you must specify both models: first the
count model, then the model predicting the certain zeros. In this example,
we are predicting count with **mathnce, langnce **and** female**, and
predicting the certain zeros with **mathnce **and** langnce**. The**
vuong** test will compare the zero-inflated negative binomial to the standard
negative binomial
model.

zinb daysabs mathnce langnce female, inflate(mathnce langnce) vuong

Fitting constant-only model: Iteration 0: log likelihood = -989.38831 Iteration 1: log likelihood = -898.75216 Iteration 2: log likelihood = -890.9426 Iteration 3: log likelihood = -890.11078 Iteration 4: log likelihood = -890.0709 Iteration 5: log likelihood = -890.07088 Fitting full model: Iteration 0: log likelihood = -890.07088 Iteration 1: log likelihood = -881.59732 Iteration 2: log likelihood = -880.80421 Iteration 3: log likelihood = -880.77785 Iteration 4: log likelihood = -880.77656 Iteration 5: log likelihood = -880.77656 Zero-inflated negative binomial regression Number of obs = 316 Nonzero obs = 254 Zero obs = 62 Inflation model = logit LR chi2(3) = 18.59 Log likelihood = -880.7766 Prob > chi2 = 0.0003 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- daysabs | mathnce | -.0011483 .0050248 -0.23 0.819 -.0109967 .0087001 langnce | -.014174 .0058023 -2.44 0.015 -.0255463 -.0028018 female | .423556 .1403317 3.02 0.003 .1485109 .698601 _cons | 2.274443 .2113109 10.76 0.000 1.860281 2.688604 -------------+---------------------------------------------------------------- inflate | mathnce | .0371789 .0598117 0.62 0.534 -.0800499 .1544076 langnce | .0078224 .0900147 0.09 0.931 -.1686031 .1842479 _cons | -6.588474 4.472095 -1.47 0.141 -15.35362 2.176671 -------------+---------------------------------------------------------------- /lnalpha | .2132175 .1359407 1.57 0.117 -.0532214 .4796564 -------------+---------------------------------------------------------------- alpha | 1.237654 .1682475 .94817 1.615519 ------------------------------------------------------------------------------ Vuong test of zinb vs. standard negative binomial: z = 0.21 Pr>z = 0.4156

Fitting constant-only model:Iteration 0: log likelihood = -989.38831 Iteration 1: log likelihood = -898.75216 Iteration 2: log likelihood = -890.9426 Iteration 3: log likelihood = -890.11078 Iteration 4: log likelihood = -890.0709 Iteration 5: log likelihood = -890.07088 Fitting full model:^{a}Iteration 0: log likelihood = -890.07088 Iteration 1: log likelihood = -881.59732 Iteration 2: log likelihood = -880.80421 Iteration 3: log likelihood = -880.77785 Iteration 4: log likelihood = -880.77656 Iteration 5: log likelihood = -880.77656^{b}

a.** Fitting constant-only model** - This is a listing of the log
likelihoods at each iteration for the logistic model predicting whether or not a
student is a certain zero. Remember that logistic regression uses maximum
likelihood estimation, which is an iterative procedure. The first iteration
(called Iteration 0) is the log likelihood of the "null" or "empty" model; that
is, a model with no predictors. At the next iteration (called Iteration
1), the variables specified for predicting certain zeroes are included in the
model. In this example, the predictors
for the constant-only model are **mathnce **and** langnce. **At each
iteration, the log likelihood increases because the goal is to maximize the log
likelihood. When the
difference between successive iterations is very small, the model is said to
have "converged" and the iterating stops. For more
information on this process for binary outcomes, see
Regression Models for Categorical and Limited Dependent Variables by J.
Scott Long (page 52-61).

b.** Fitting full model** - This is a listing of the log likelihoods at each iteration
for the full model, combining the constant-only model with the count model.
Again, the fitting of this model is an iterative procedure. Note that the log
likelihood of Iteration 0 for the full model is equal to the log likelihood at
which the constant-only model had converged. This illustrates that the
full model begins with the fitted constant-only model stopped and improves on it
with the count model.

Zero-inflated negative binomial regression Number of obs= 316 Nonzero obs^{e}= 254 Zero obs^{f}= 62 Inflation model^{g}= logit LR chi2(3)^{c}^{h}= 18.59 Log likelihood= -880.7766 Prob > chi2^{d}^{i}= 0.0003

c. **Inflation model** - This indicates that the inflated model is a logit
model, predicting a latent binary outcome: whether or not a student is a certain zero. This also informs the interpretation of the parameter estimates.

d. ** Log Likelihood** - This is the log likelihood of the fitted full model. It
is used in the Likelihood Ratio Chi-Square test of whether all predictors'
regression coefficients in the model are simultaneously zero.

e. **Number of obs** - This is the number of observations in the dataset
for which all of the response and predictor variables are non-missing.

f. **Nonzero obs** - This is the number of observations in the dataset for
which the response variable is not equal to zero.

g. **Zero obs** - This is the number of observations in the dataset for
which the response variable is equal to zero.

h. **LR chi2(3)** - This is the Likelihood Ratio (LR) Chi-Square test that at least one of the predictors' regression
coefficient is not equal to zero. The number in the parentheses indicates the
degrees of freedom of the Chi-Square distribution used to test the LR Chi-Square
statistic and is defined by the number
of predictors in the model (3). The LR Chi-Square statistic can be calculated by
-2*( L(null model of full model) - L(fitted model of full model)) = -2*((-890.07088) - (-880.77656)) =
18.59.

i.** Prob > chi2** - This is the probability of getting a LR test
statistic as extreme as, or more so, than the observed statistic under the null
hypothesis; the null hypothesis is that all of the regression coefficients
across both models are simultaneously equal to zero. In other words, this is the
probability of obtaining this chi-square statistic (18.59) or one more extreme if there is in fact
no effect of the predictor variables. This p-value is compared to a specified
alpha level, our willingness to accept a type I error, which is typically set at
0.05 or 0.01. The small p-value from the LR test, 0.0003, would lead us to
conclude that at least one of the regression coefficients in the model is not
equal to zero. The parameter of the chi-square distribution used to test the
null hypothesis is defined by the degrees of freedom in the prior line, **
chi2(3)**.

------------------------------------------------------------------------------ | Coef.Std. Err.^{l}z^{m}P>|z|^{n}[95% Conf. Interval]^{o}-------------+---------------------------------------------------------------- daysabs^{p}| mathnce | -.0011483 .0050248 -0.23 0.819 -.0109967 .0087001 langnce | -.014174 .0058023 -2.44 0.015 -.0255463 -.0028018 female | .423556 .1403317 3.02 0.003 .1485109 .698601 _cons | 2.274443 .2113109 10.76 0.000 1.860281 2.688604 -------------+---------------------------------------------------------------- inflate^{j}| mathnce | .0371789 .0598117 0.62 0.534 -.0800499 .1544076 langnce | .0078224 .0900147 0.09 0.931 -.1686031 .1842479 _cons | -6.588474 4.472095 -1.47 0.141 -15.35362 2.176671 -------------+---------------------------------------------------------------- /lnalpha^{k}| .2132175 .1359407 1.57 0.117 -.0532214 .4796564 -------------+---------------------------------------------------------------- alpha^{q}| 1.237654 .1682475 .94817 1.615519 ------------------------------------------------------------------------------ Vuong test of zinb vs. standard negative binomial:^{r}^{s}z = 0.21 Pr>z = 0.4156

j. **daysabs** - This is the response variable predicted by the full model.

k. **inflate** - This portion of the output refers to the logistic model predicting whether or not a student is a certain zero.

l. **Coef.** - These are the regression coefficients.
The coefficients in the **daysabs **section of the output are interpreted as
you would interpret coefficients from a standard negative binomial model: the expected
number of days absent changes by exp(**Coef.**) for each unit increase in the
corresponding predictor.

** Predicting Days Absent for Students Not in the
"Certain Zero" Group**

** mathnce - **If a subject
were to increase his **mathnce** score by one point, the expected number of
days absent in a year would decrease by a factor of exp(-.0011483) = 0.99885236 while holding all other variables in the model constant.
Thus, the higher a student's **mathnce** score, the fewer predicted days
absent.

** langnce - **If a subject were to increase his **
langnce** score by one point, the expected number of days absent in a year
would decrease by a factor of exp(-.014174) = 0.98592598 while holding all other
variables in the model constant. Thus, the higher a student's **langnce**
score, the fewer predicted days absent.

** female - **The expected number of days absent in a
year for a female student is exp(0.423556) = 1.5273833 times the expected number
of days in a year for a male student while holding all other variables in the
model constant. If a female student and male student are not certain zeros
and have identical **mathnce** and **langnce** scores, the expected number
of days absent for the female student would be 1.5273833 times the expected
number of days absent for the male student.

** _cons** - If all of the predictor variables in the model are
evaluated at zero, the predicted number of days absent would be calculated as
exp(**_cons**) = exp(2.274443). For males (the variable **female** evaluated at zero) with
zero **mathnce** and **langnce** scores, the predicted number of days
absent would be 9.7225021. This may seem very high, considering the mean number
of days absent is less than 6, but note that evaluating **mathnce** and
**langnce** at zero is out of the range of plausible scores.

** Predicting Membership in the "Certain Zero" Group**

** mathnce - **If a subject
were to increase his **mathnce** score by one point, the odds that he would
be in the "Certain Zero" group would increase by a factor of exp(0.0371789) =
1.0378787. In other words, the higher a student's **mathnce** score, the more
likely the student is a certain zero.

** langnce - **If a subject were to increase his **
langnce** score by one point, the odds that he would be in the "Certain Zero"
group would increase by a factor of exp(0.0078224) = 1.0078531. In other words,
the higher a student's **langnce** score, the more likely the student is a
certain zero.

** _cons** - If all of the predictor variables in the model are
evaluated at zero, the odds of being in the "Certain Zero" group is
exp(-6.588474) = .00137614. This means that the predicted odds of a
student with **mathnce** and **langnce** scores of zero being a certain
zero are .00137614 (though remember that evaluating **mathnce** and
**langnce** at zero is out of the range of plausible scores).

m. **Std. Err.** - These are the standard errors of the individual
regression coefficients for the two models. They are used
in both the calculation of the **z **test statistic, superscript n, and the
confidence interval of the regression coefficient, superscript p.

n. **z** - This is the test statistic **z** is the ratio of the **Coef.** to the **Std. Err.** of the respective predictor. The z value follows a standard normal distribution which is used to test against a two-sided alternative hypothesis that the
**Coef.** is not equal to zero.

o. **P>|z|** - This is the probability the **z** test statistic (or a more extreme test statistic) would be observed under the null hypothesis
that a particular predictor's regression coefficient is zero, given that the
rest of the predictors are in the model. For a given alpha level, **P>|z|** determine whether or not the null hypothesis
can be rejected. If **P>|z| **
is less than alpha, then the null hypothesis can be rejected and the parameter
estimate is considered significant at that alpha level.

** Predicting Days Absent for Students Not in the
"Certain Zero" Group**

**mathnce** - The **z** test
statistic for the predictor **mathnce** is (-0.0011483/0.0050248) = -0.23 with an
associated p-value of 0.819. If we set our alpha level to 0.05, we would fail to
reject the null hypothesis and conclude that the regression coefficient for **
mathnce** has not been
found to be statistically different from zero given **langnce** and **female**
are in the model.

** langnce -**The **z** test
statistic for the predictor **langnce** is (-0.014174/0.0058023) = -2.44 with an
associated p-value of 0.015. If we set our alpha level to 0.05, we would
reject the null hypothesis and conclude that the regression coefficient for **
langnce** has been
found to be statistically different from zero given **mathnce** and **female**
are in the model.

** female - **
The **z** test
statistic for the predictor **female** is (0.423556/0.1403317) = 3.02 with an
associated p-value of 0.003. If we again set our alpha level to 0.05, we would
reject the null hypothesis and conclude that the difference between males and
females has been found to be statistically different given that **mathnce** and
**langnce** are in the
model.

** _cons** - The **z** test
statistic for the intercept, **_cons,** is (2.274443/0.2113109) = 10.76 with
an associated p-value of < 0.001. If we set our alpha level at 0.05, we would
reject the null hypothesis and conclude that **_cons** has been found to be
statistically different from zero given **mathnce**, **langnce **and **
female** are in the model and evaluated at zero.

** Predicting Membership in the "Certain Zero" Group**

** mathnce - **
The **z** test
statistic for the predictor **mathnce** is (0.0371789/0.0598117) = 0.62 with an
associated p-value of 0.534. If we again set our alpha level to 0.05, we would
fail to reject the null hypothesis and conclude that the regression coefficient for
**mathnce** has not been
found to be statistically different from zero given **langnce** is in the
model.

** langnce - **The **z** test
statistic for the predictor **langnce** is (0.0078224/0.0900147) = 0.09 with an
associated p-value of 0.931. If we set our alpha level to 0.05, we would fail to reject
the null hypothesis and conclude that the regression coefficient for **langnce** has
not been
found to be statistically different from zero given **mathnce** is in the model.

**_cons** -The **z** test
statistic for the intercept, **_cons,** is (-6.588474/4.472095) = -1.47 with an
associated p-value of 0.141. With an alpha level of 0.05, we would fail to reject the
null hypothesis and conclude that **_cons** has not been found to be
statistically different from zero given **mathnce** and **langnce **are in
the model and evaluated at zero.

p. **[95% Conf. Interval]** - This is the Confidence Interval (CI) for an
individual coefficient given that the other predictors are in the model. For a
given predictor with a level of 95% confidence, we'd say that we are 95%
confident that the "true" coefficient lies between the lower and upper limit of
the interval. It is calculated as the **Coef.** ± (z_{α/2})*(**Std.Err.**),
where z_{α/2} is a critical value on the standard normal distribution.
The CI is equivalent to the **z** test statistic: if the CI includes zero,
we'd fail to reject the null hypothesis that a particular regression coefficient
is zero given the other predictors are in the model. An advantage of a CI is
that it is illustrative; it provides a range where the "true" parameter may
lie.

q. **/lnalpha** - This is the natural log of alpha (the dispersion parameter). If the dispersion parameter is zero,
log(dispersion parameter) = -infinity. If this is true, then a Poisson
model would be appropriate. We can see the 95% confidence interval for **/lnalpha**
that the value is not -infinity. This is confirmed by the 95% confidence
interval about the estimate of **alpha** that does not contain zero.

r. **alpha** - This is the dispersion parameter of the count model.

s. **Vuong test** - This test compares the zero-inflated negative binomial model to a standard
negative binomial model. Because the z-value not significant, the Vuong test shows that the zero-inflated
negative binomial is not a
better fit than the standard negative binomial. In cases where there is a question as to
which count model to use, the **countfit** command is helpful for comparing the range of
count models. You can download **countfit** from within Stata by typing **findit countfit**
(see
How can I used the findit command to search for programs and get additional
help? for more information about using **findit**).

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.