Help the Stat Consulting Group by giving a gift

Logistic Regression Troubleshooting and Ologit Interpretation

This Library page comes from an answer from the StataList newgroup, and is courtesy of William Gould of Stata Corporation. We are grateful for the ability to reproduce this answer at our site.

Mario Nosvelli <nosvelli@i...> showed results from running -ologit- and asked two questions, > 1) How to consider such a LR chi2(9)= 12603.75 with Prob > chi2 = 0.0000: > it is too good to be true....? > > 2) How to intepret correctly coefficient in explaining my dependent > variable? I suspect Mario will not get too many answers to his first question because not many will be willing to go out a limb. One can, without fear of contradiction, look at results that exhibit an obvious problem and say "no good", but when the results look fine, one hesitates to say "All is fine". Who knows what is lurking behind the numbers? Nevertheless, I will go out on the limb and reassure Mario. I think all is fine, assuming Mario has checked out his data in all the obvious ways. In this case, what would most concern me (and still, I'm not concerned much), is outliers. I would like to be reassured that, say, edtime mostly takes on the values 0 to 8 but for one observation in the data takes on the value 10,000. Basically, I have only the concerns I would have whenever I looked at an estimated model: I want to convince myself that these results are not being determined by just a handful of observations in the data for which another explanation (for instance, data error) is a more likely explanation. Mario concerns were raised by the reported LR chi2 for his model. The results Mario showed were ============================================================================== ologit profribalt edtime edtime2 lingua10 info10 forma10r ptr10r > espe10 tenure10 compo10 Iteration 0: log likelihood = -23409.295 [...] Iteration 5: log likelihood = -17107.421 Ordered logit estimates Number of obs = 13341 LR chi2(9) = 12603.75 Prob > chi2 = 0.0000 Log likelihood = -17107.421 Pseudo R2 = 0.2692 ------------------------------------------------------------------------------ profribalt | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- edtime | .645283 .0223021 28.93 0.000 .6015717 .6889944 edtime2 | -.0057201 .0019815 -2.89 0.004 -.0096037 -.0018365 lingua10 | .1485893 .0535689 2.77 0.006 .0435962 .2535823 info10 | .1644362 .0621983 2.64 0.008 .0425298 .2863425 forma10r | -.2904943 .0387972 -7.49 0.000 -.3665354 -.2144532 ptr10r | -.2938893 .0395879 -7.42 0.000 -.3714801 -.2162986 espe10 | .4393347 .0340193 12.91 0.000 .3726581 .5060114 tenure10 | .0870511 .0343935 2.53 0.011 .0196411 .1544612 compo10 | .7923367 .0910826 8.70 0.000 .6138181 .9708553 -------------+---------------------------------------------------------------- _cut1 | -1.931222 .0479324 (Ancillary parameters) _cut2 | .729953 .0381873 _cut3 | 2.42457 .0465623 _cut4 | 2.525509 .0472749 _cut5 | 3.572923 .0541766 _cut6 | 6.480454 .071594 _cut7 | 8.203291 .0884657 ------------------------------------------------------------------------------ ============================================================================== The LR chi2 is a test that all the coefficients (with the exception of the cutponits) are zero. The value LR chi2(9) = 12,604 is admittedly whopping and Mario is right to raise red flags. It is worth some thought. Nevertheless, such unbelievable values to arise in large datasets. What reassured me that there was no problem was the reported log likelihood value of -17,107 for 13,341 observations. I said to myself, "Mario has 8 outcome catagories, so if I had no idea to which category an observation belonged, I would use a probability of 1/8 = .125 for each. On average, Mario's model says that for an observation, the probability of being observed what was observed conditional on the estimates is exp(-17107/13341)=.277. That number is larger practically larger than .125, and that's good news because it means that Mario's model has explanatory power. That number is far enough from 1 that Mario does not have to explain to me why his model does such a good job. -------------------------------------------------------------------------- ASIDE: The exp(-17107/13341) arises like this: Ordered logit is a discrete model, meaning likelihoods are probablities. The overall likelihood of the data is L(Data|estimates) = L(data_1|estimates) * L(data_2|estimates) * ... = p(o_1|X_1,estimates)* p(o_2|X_2,estimates) * ... where o_j and X_j are the outcome and explanatory variables for observation j, and p(o_j|X_j,estimates) means the probability that outcome o_j is observed conditional on the values of X_j observed along with the estimated coefficients. The log-likelihood reported by Stata is just the log of the above. Thus, the geometric average of p(o_j|X_j,estimates) is just exp(overall_log_likelihood/number_of_observations). The above works for ordered logit, logit, and other discrete models, but you cannot interprete likelihoods of continous models as probabilities; they are densities. You make the calculation above for some continuous models and data combinations, and you will get a result greater than 1. There is nothing wrong with making an average density calculation, you just have to know how to interpret what you get. -------------------------------------------------------------------------- Okay, so that was the first thing I did to reassure myself that Mario's model was okay. Actually, that was the second thing. The first thing I did was look at the output and looked at the reported standard errors and z statistics, looking to be sure that no standard error was reported to be absurdly small (no z absurdly large because z=coef/se). At that point, I was just looking for what looked like calculations gone awry, such as a standard error equal to . or 1e-300, or significance levels of . or 1e+300. Like Mario, the only bothersome result I saw was the LR chi2, and the calculation I did above reassured me that this is just one of those cases where the LR chi2 produces unbeliveable values. Concerning that, let me remind Mario, As a first approximation, STATISTICAL ESTIMATES PROVIDE A THEORETICAL LOWER BOUND ON THE LEVEL OF UNCERTAINTY. The statistical results are exactly right if all the assumptions are met, but you know you do not believe that. Is your specification correct? Is it really education squared and not, say education to the 2.1 power? Is the distribution of the outcome really the logistic and not something else? Uncertainty is lurking everywhere you look, and our model-summary statistics measure only the role of chance conditional on our assumptions. For small sample sizes, one can quite reasonably argue that the uncertainty we do measure is the most important. As sample sizes get larger, the relative role of these other uncertainties becomes more important. Concerning Mario's second question, > 2) How to intepret correctly coefficient in explaining my dependent > variable? the first thing to say is obvious: positive coefficients increase the chances that the subject will be observed in a higher category, and negative coefficients increase the canges the subject will e observed in a lower category. Actually, ordered logit can be interpreted much like logit and, on that score, it is unfortunate that Stata does not output the exponentiated coefficients. They are, however, easy enough to calculate yourself, but here's a trick for getting Stata to calculate them for you: ologit ... mat b = e(b) mat v = e(V) est post b v est di, eform(OR) If you do that, ignore the output for the cutpoints. Anyway, let's begin with logit. The exponentiated coefficients in logit can be interpeted as odds ratios for a 1-unit change in the corresponding variable. The emphasis here is on ratio: exp(b) is the odds conditional on x+1 dividied by the odds conditional on x. .5 means the odds increase 50 percent if x increases by 1. The ordered-logit model is also known as the proportional odds model. Let's call the outcome variable Y. In this model, if one considers odds(k) = P(Y<=k)/P(Y>k), then odds(k_1)/odds(k_2) is a constant for all values of k_1 and k_2. An implication of this is that exponentiated coefficients can be thought of as the odds ratio of being in a higher category for a one-unit change in the variable. You may find this logic transparent, but I admit I find it confusing. I have to think really hard, think I get it, and then get confused again. So let me tell you another way to think about it. Let's put aside the ordered logit model for a minute. We have, let us assume, eight outcomes. We could analyze this data by looking at the probability of being in outcome Y==1 versus outcomes Y==2, Y==3, and so on. We could just use ordinary logistic regression to do that: gen outcome = Y>1 logistic outcome ... That would be an inefficient way of analyzing our data, but we could do that. Similarly, we could analyze our data by looking at the proability of being in outcomes Y==1 or Y==2 versus Y==3, Y==4, and so on: gen outcome2 = Y>2 logistic outcome2 ... And similarly we could look at the five other other binary-outcome analyses: Y<=3 versus Y>3, Y<=4 versus Y>4, Y<=5 versus Y>5, Y<=6 versus Y>6, and Y<=7 versus Y>7. Ordered logit amounts to doing just that, but adds the constraint that the coefficients from each individual analysis are equal while leaving the intercepts free to vary. Thus, ordered logit is logistic regression, and I can interpret ordered logit in exactly the same way as I interpret ordinary binary-outcome logistic regression. And thus, exponentiated coefficients are odd ratios of the odds of being Y==2, 3, ..., 8 vs. Y=1, and they are odds ratios of being in Y==3, 4, ..., 8 vs. Y==1 or 2, and so on. -- Bill wgould@s... * * Help is available at * http://www.stata.com/support/statalist/faq

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.