### Stata Library Analyzing Correlated (Clustered) Data

Analyzing Correlated Data

Correlated data are fairly common in social science research.  Husbands' responses to marital satisfaction questions are related to (correlated with) wives' responses.  Parents' assessment of their child's achievement is correlated with the child's assessment of his or her achievement.  Members of the same household are likely to be more similar on a wide variety of measures than to nonmembers.  Sometimes the correlated nature of the data is obvious and is considered as the data are being collected.  Other times, the correlated nature is less obvious and was not considered as the data were collected.  Either way, to correctly analyze the data, the correlation needs to be taken into account.  If it is not, the standard errors of the estimates will be off (usually underestimated), rendering significance tests invalid.  This happens because the standard errors that are normally reported with an analysis assume that each observation is independent of all other observations in the data set.  To the extent that this is not true (i.e., as the correlation becomes larger), each observation contain less unique information.  (Another consequence of this is that the effective sample size is diminished.)  This kind of correlation (between observations) is called an intraclass correlation.  It is different from a Pearson correlation, which is between two variables.

So, how bad could ignoring the intraclass correlation be?  We have reproduced Table 1.1 from Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw which shows what happens to the alpha values when you thought that they were 0.05.  The rows of the table show different values of N, the number of subjects in the experiment or survey.  The columns show different values of rho, the intraclass correlation coefficient.  As you can see, if you have only 10 subjects and an intraclass correlation coefficient of 0.01, your true alpha value is 0.06, which is not much different from 0.05.  However, if you have 100 subjects and an intraclass correlation coefficient of 0.20, your real alpha level is 0.70!

Alpha = 0.05, Table 1.1, page 10 from Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw

 rho N 0.01 0.05 0.20 10 0.06 0.11 0.28 25 0.08 0.19 0.46 50 0.11 0.30 0.59 100 0.17 0.43 0.70

When people realize that they are analyzing correlated data, they often wonder about the possible strategies that are available to account for the correlation.  Three possibilities come to mind:  using survey commands with the psu (primary sampling unit) set, using clustered robust standard errors in a standard analysis (for example, a regression analysis), or using a multilevel (AKA  hierarchical) model.  These three methods yield slightly different results, and they are intended for use in different situations.  Let's look at when you would use each of these methods and how they are different from each other.  Because both SAS and Stata have commands for accomplishing these three types of analyses, we will focus on those packages.  (A note on the difference between robust standard errors and clustered robust standard errors can be found at the bottom of this page.)

#### What is an intraclass correlation and how big is it?

An intraclass correlation tells you about the correlation of the observations (cases) within a cluster.  For example, if you were measuring political attitudes of people within households, households would be the cluster variable.  An intraclass correlation would tell you the average correlation of people within household.  As you can see, the higher the intraclass correlation, the less unique information each additional household member provides.  Hence, in cases of high intraclass correlations, most researchers would prefer to have more clusters with fewer cases per cluster than to have more cases within a small number of clusters.

Before trying to correct for the intraclass correlation, you might ask "How large is the intraclass correlation?"  This is a reasonable question.  If the correlation is shown to be relatively small (however "relatively small" is defined), then one might choose to ignore the correlation and analyze the data in a standard way, knowing that the standard errors would be off by some, but not be much.  In other words, the effort to correct the standard errors might outweigh the benefit.  Remember that an intraclass correlation is much different from a Pearson correlation, so the standards that apply to a Pearson correlation do not apply to an intraclass correlation.  For example, while a Pearson correlation of 0.3 may be considered small, an intraclass correlation of 0.3 is quite large.  Also, while a Pearson correlation coefficient can be negative, an intraclass correlation coefficient cannot.  (In rare cases, you can get a negative intraclass correlation coefficient, but that usually means that there is trouble with the data.  Also, remember that an intraclass correlation coefficient describes the relationship of cases within a cluster, so if cases within a particular cluster are more similar to cases within a different cluster than to cases within its cluster, the coefficient would still be zero.)

There are several definitions of an intraclass correlation, and many different versions of the formula for its calculation.  Many texts will show simplified versions of the formula that apply only to specific situations.  We have opted to give a formula that uses elements of standard ANOVA output.

F is the F statistic from the ANOVA (see page 19 of Snijders and Bosker for formula for weighted average).

is the weighted average number of elements (cases) per cluster

is the mean sample size

N is the number of clusters

M is the total sample size s-squared (put in real symbol here!) is the variance of the sample sizes

is the intraclass correlation

#### The issues and options

Perhaps the first question that needs to be addressed is how the data were collected.  If the data were collected as part of a survey, and by survey we mean a survey with an explicit sampling plan, then using the survey commands in standard statistical software or using specialty survey software (such as SUDAAN or WesVar) is usually the most appropriate choice.  (The only time that this can be problematic is when you want to conduct a true multilevel analysis and still include the features of sampling design.  As of this writing, gllamm in Stata can handle most features of the survey design, as can Mplus 5.)  The point here is that when survey data are collected, the correlational nature of the data is explicitly taken into account.  In this framework, the intraclass correlation is seen as a nuisance that merely needs to be accounted for.

If the data were not collected as part of a survey, then you have the option of using clustered robust standard errors or using a multilevel model.  First, let's discuss clustered robust standard errors, as they are, mathematically speaking, very similar to using survey techniques.  Simply stated, it is method of inflating the standard errors.  Most commonly, Huber-White (also called Sandwich or robust) standard errors are used.  In the SAS documentation, this type of standard error is called an empirical standard error.  As you will see below, the standard errors produced using this method are very close to those produced using survey methods.  In fact, Stata's survey routine calls the same routine used to create clustered robust standard errors.  The slight difference between the two is caused by a small difference in a constant multiplier (similar to an finite population correction or finite sample correction).  (For more information regarding the difference between robust and clustered standard errors in Stata, please see section 20.14 (pages 275-280) of the User's Guide for Stata version 9).  This method of correcting the standard errors to account for the intraclass correlation is a "weaker" form of correction than using a multilevel model, which not only accounts for the intraclass correlation, but also corrects the denominator degrees of freedom for the number of clusters.  When you use clustered robust standard errors, the denominator degrees of freedom is based on the number of observations, not the number of clusters.  Because variability in the predictors and residuals is removed when a multilevel model is used, the point estimates from this method are different than those obtained using the previous two methods.  There are several helpful references for Huber-White standard errors, which we have listed at the end of this page.

Choosing between using clustered robust standard errors and multilevel modeling is not always easy.  (Please note that while we are presenting these as two options, you can clustered robust standard errors with multilevel model estimates of fixed effects.  We present these as two different options only in the context of starting with a ordinary least squares regression.)  One factor to be considered is how many clusters you have.  If you have relatively few, then using a multilevel model becomes less of an option because you need to have a fair number of clusters just to get the procedure to run.  However, clustered robust standard errors also need a fair number of clusters in order to be reliably computed (please see the references at the end of this page for more on this).  Also, if you are working with longitudinal data and the design is severely unbalanced, then clustered robust standard errors may not be a good option.  Another consideration is audience of the research.  If the audience is not familiar with multilevel modeling techniques or is not statistically sophisticated, then perhaps robust standard errors are a preferable way to proceed, since the type of analysis itself is not changed.  A third consideration has to do with the researcher's comfort with the various techniques.  If the researcher does not feel comfortable conducting or presenting a particular type of analysis, other choices may be preferable.  We know of no references on statistical considerations for making the decision of what method to use to account for the intraclass correlation.

#### The data set

We will use the api data set, which contains the api scores for schools in California in the year 2000.  The variable aip00 is the score, growth indicates the percent of growth experienced by the school in the last year, emer is the percent of teachers at that school with emergency credentials, and yr_rnd is a dummy (0/1) variable indicating if the school is on a year-round schedule.  In our example, we will do a regression analysis with api00 as the dependent variable and growth, emer and yr_rnd as independent variables.  The data are correlated; schools are nested within districts.  The variable dnum contains the number of each school district.  You can download the api00 data set in SAS format here.  The data set in Stata format is available online from within Stata, as shown below.

For each method described, we will present two analyses.  The first will ignore the correlation within the data, and the second will account for it.  In this way, you can see how the results differ.  We do not mean to suggest that in all cases the magnitude of the difference between the two analyses will be same.  For some data sets, the difference will be much larger or smaller than what was obtained here.

#### Survey method

In "survey speak", the clustering variable is called a primary sampling unit, or PSU.  When doing the sampling for a survey, the PSU is the first unit of sampling.  For example, when taking a national sample, one might sample states, and then counties within states and then cities within counties.  In this example, states would be the primary sampling unit, since that was the first thing that was sampled.  If the population was defined as counties in the United States, then counties would be the first thing sampled and they would constitute the PSU.  In each case, it is easy to see that observations with a cluster may be more similar than observations between clusters.  We have online seminars with more information on the various features of sampling designs Introduction to Survey Data Analysis and Survey Data Analysis with Stata .  Also, for more information regarding the analysis of survey data and how the various elements of the sampling design are used by survey commands, please see pages 5 - 13 of the Stata 9 Survey manual.

#### Survey in Stata

First, let's ignore the cluster variable and conduct a regular regression.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs, clear
regress api00 growth emer yr_rnd

Source |       SS       df       MS              Number of obs =     309
-------------+------------------------------           F(  3,   305) =   38.94
Model |  1453469.16     3   484489.72           Prob > F      =  0.0000
Residual |   3794617.4   305  12441.3685           R-squared     =  0.2770
-------------+------------------------------           Adj R-squared =  0.2698
Total |  5248086.56   308  17039.2421           Root MSE      =  111.54

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.1027121   .2111831    -0.49   0.627    -.5182723    .3128481
emer |  -5.444932   .5395432   -10.09   0.000    -6.506631   -4.383234
yr_rnd |  -51.07569   19.91364    -2.56   0.011     -90.2612   -11.89018
_cons |   740.3981   11.55215    64.09   0.000     717.6661    763.1301
------------------------------------------------------------------------------

Now we will use the survey commands and account for the clustering by using the svyset command to tell Stata that the PSU is a variable called dnum.

svyset dnum
svy: regress api00 growth emer yr_rnd

Survey linear regression

pweight:  <none>                                  Number of obs    =       309
Strata:   <one>                                   Number of strata =         1
PSU:      dnum                                    Number of PSUs   =       186
Population size  =       309
F(   3,    183)  =     19.69
Prob > F         =    0.0000
R-squared        =    0.2770

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.1027121   .2280515    -0.45   0.653     -.552628    .3472038
emer |  -5.444932    .725836    -7.50   0.000    -6.876912   -4.012952
yr_rnd |  -51.07569   22.72466    -2.25   0.026    -95.90849   -6.242884
_cons |   740.3981   13.39504    55.27   0.000     713.9714    766.8248
------------------------------------------------------------------------------

As you can see from the results above, while the coefficients are the same in the two analyses, the standard errors are smaller in the analysis that does not account for the clustering in the data.

#### Survey in SAS

Now we will run the same two analyses in SAS.

proc reg data = "D:/temp/api2000";
model api00= growth emer yr_rnd;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: API00

Number of Observations Read                        310
Number of Observations Used                        309
Number of Observations with Missing Values           1

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3        1453469         484490      38.94    <.0001
Error                   305        3794617          12441
Corrected Total         308        5248087

Root MSE            111.54088    R-Square     0.2770
Dependent Mean      663.17476    Adj R-Sq     0.2698
Coeff Var            16.81923
Parameter Estimates

Parameter       Standard
Variable     Label                  DF       Estimate          Error    t Value    Pr > |t|

Intercept    Intercept               1      740.39808       11.55215      64.09      <.0001
GROWTH                               1       -0.10271        0.21118      -0.49      0.6271
EMER         pct emer credential     1       -5.44493        0.53954     -10.09      <.0001
YR_RND                               1      -51.07569       19.91364      -2.56      0.0108
proc surveyreg data = "D:/temp/api2000";
cluster dnum;
model api00= growth emer yr_rnd;
run;
Regression Analysis for Dependent Variable API00

Data Summary

Number of Observations           309
Mean of API00              663.17476
Sum of API00                204921.0

Design Summary

Number of Clusters           186

Fit Statistics

R-square            0.2770
Root MSE            111.54
Denominator DF         185

Tests of Model Effects

Effect       Num DF    F Value    Pr > F

Model             3      19.72    <.0001
Intercept         1    3025.46    <.0001
GROWTH            1       0.20    0.6545
EMER              1      55.73    <.0001
YR_RND            1       5.00    0.0265

NOTE: The denominator degrees of freedom for the F tests is 185.

Estimated Regression Coefficients

Standard
Parameter      Estimate         Error    t Value    Pr > |t|

Intercept    740.398084    13.4607592      55.00      <.0001
GROWTH        -0.102712     0.2291703      -0.45      0.6545
EMER          -5.444932     0.7293969      -7.46      <.0001
YR_RND       -51.075689    22.8361501      -2.24      0.0265

NOTE: The denominator degrees of freedom for the t tests is 185.

#### Clustered robust standard errors method

As previously stated, this method is very similar to the survey method.  The difference is that when you select this method, your data were not collected using a sampling plan.  Rather, the correlations between observations are present despite the use of simple random sampling (or a convenience sample).

#### Clustered robust standard errors in Stata

In the first regression, we will analyze the data as if there was no correlation between schools within districts.  In the second analysis, we will use the cluster option to obtain robust standard errors.  As you can see from the output below, the point estimates given in the two outputs are the same, but the standard errors are not.  The each of the robust standard errors are larger than the standard error for that variable in the first analysis.  Also notice that while the R-squared and Root MSE are the same in the two analyses, the value of the F-test is different.  This is because the denominator degrees of freedom is different.  In the first analysis, it is 305 (there are 310 observations in the data set), while in the second analysis it is 185.  (As shown in the output, there are 186 clusters.)  Also note that there is no Adjusted R-squared given in the second analysis.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs, clear
regress api00 growth emer yr_rnd

Source |       SS       df       MS              Number of obs =     309
-------------+------------------------------           F(  3,   305) =   38.94
Model |  1453469.16     3   484489.72           Prob > F      =  0.0000
Residual |   3794617.4   305  12441.3685           R-squared     =  0.2770
-------------+------------------------------           Adj R-squared =  0.2698
Total |  5248086.56   308  17039.2421           Root MSE      =  111.54

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.1027121   .2111831    -0.49   0.627    -.5182723    .3128481
emer |  -5.444932   .5395432   -10.09   0.000    -6.506631   -4.383234
yr_rnd |  -51.07569   19.91364    -2.56   0.011     -90.2612   -11.89018
_cons |   740.3981   11.55215    64.09   0.000     717.6661    763.1301
------------------------------------------------------------------------------
regress api00 growth emer yr_rnd, cluster(dnum)

Regression with robust standard errors                 Number of obs =     309
F(  3,   185) =   19.72
Prob > F      =  0.0000
R-squared     =  0.2770
Number of clusters (dnum) = 186                        Root MSE      =  111.54

------------------------------------------------------------------------------
|               Robust
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.1027121   .2291703    -0.45   0.655    -.5548352    .3494111
emer |  -5.444932   .7293969    -7.46   0.000    -6.883938   -4.005927
yr_rnd |  -51.07569   22.83615    -2.24   0.027    -96.12844   -6.022935
_cons |   740.3981   13.46076    55.00   0.000     713.8418    766.9544
------------------------------------------------------------------------------

#### Clustered robust standard errors in SAS

In SAS, you can use proc genmod to account for clustering in data.

proc genmod data = "D:/temp/srs";
model api00= growth emer yr_rnd;
run;
The GENMOD Procedure

Model Information

Data Set              D:/temp/srs    Written by SAS

Distribution               Normal
Dependent Variable          API00

Number of Observations Read         310
Number of Observations Used         309
Missing Values                        1

Criteria For Assessing Goodness Of Fit

Criterion                 DF           Value        Value/DF

Deviance                 305    3794617.4031      12441.3685
Scaled Deviance          305        309.0000          1.0131
Pearson Chi-Square       305    3794617.4031      12441.3685
Scaled Pearson X2        305        309.0000          1.0131
Log Likelihood                    -1893.1858

Algorithm converged.

Analysis Of Parameter Estimates

Standard     Wald 95% Confidence       Chi-
Parameter    DF    Estimate       Error           Limits            Square    Pr > ChiSq

Intercept     1    740.3981     11.4771    717.9033    762.8929    4161.63        <.0001
GROWTH        1     -0.1027      0.2098     -0.5139      0.3085       0.24        0.6245
EMER          1     -5.4449      0.5360     -6.4956     -4.3943     103.18        <.0001
YR_RND        1    -51.0757     19.7843    -89.8523    -12.2991       6.66        0.0098
Scale         1    110.8166      4.4577    102.4152    119.9072

NOTE: The scale parameter was estimated by maximum likelihood.

Note that you have to have the class statement before the repeated statement, or you will get an error message.

proc genmod data = "D:/temp/srs";
class dnum;
model api00= growth emer yr_rnd;
repeated subject = dnum;
run;

<some output omitted to save space>

             Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates

Standard   95% Confidence
Parameter Estimate    Error       Limits            Z Pr > |Z|

Intercept 740.3981  13.3590 714.2150 766.5812   55.42   <.0001
GROWTH     -0.1027   0.2274  -0.5485   0.3431   -0.45   0.6516
EMER       -5.4449   0.7239  -6.8637  -4.0261   -7.52   <.0001
YR_RND    -51.0757  22.6635 -95.4953  -6.6561   -2.25   0.0242

You will notice that the point estimates in this second analysis match those obtained using Stata, but the standard error are slightly different.  That is because Stata uses a constant similar to a finite population correction (fpc) called a finite sample correction (page 351-352) when calculating robust standard errors, while SAS does not.  To compare the formulas used by Stata and SAS for calculating the standard errors, please see Stata 8 Reference manual N - R, pages 336-341 and the online SAS documentation http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_sect39.htm.  To apply the fpc or to see the formula used, please see our SAS Code Fragment on Getting Robust Standard Errors for Clustered Data.

#### Multilevel modeling method

When using a multilevel modeling technique to account for the intraclass correlation, you need to make sure that you have random intercepts.  This is the default in Stata's xtreg command.  This is necessary because our cluster variable is a random variable.  If your cluster variable is not a random variable, you can still use this method, but you will have to do some extra work to get the correct denominator.  We understand that the explanation provided here is not exhaustive.  Please see our Statistics Books for Loan for suggested reading for more information.

#### Multilevel modeling in Stata

xtset dnum
xtreg api00 growth emer yr_rnd, mle

Fitting constant-only model:
Iteration 0:   log likelihood = -1931.1472
Iteration 1:   log likelihood = -1925.0996
Iteration 2:   log likelihood =  -1924.698
Iteration 3:   log likelihood = -1924.6953

Fitting full model:
Iteration 0:   log likelihood = -1878.6289
Iteration 1:   log likelihood = -1877.6236
Iteration 2:   log likelihood = -1877.6093
Iteration 3:   log likelihood = -1877.6093

Random-effects ML regression                    Number of obs      =       309
Group variable (i): dnum                        Number of groups   =       186

Random effects u_i ~ Gaussian                   Obs per group: min =         1
avg =       1.7
max =        28

LR chi2(3)         =     94.17
Log likelihood  = -1877.6093                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.0980205   .2016164    -0.49   0.627    -.4931814    .2971403
emer |  -5.639125   .5695138    -9.90   0.000    -6.755351   -4.522898
yr_rnd |  -39.64472   18.43406    -2.15   0.032    -75.77481   -3.514637
_cons |   748.1934   11.97179    62.50   0.000     724.7291    771.6577
-------------+----------------------------------------------------------------
/sigma_u |   71.88487   8.258147     8.70   0.000      55.6992    88.07054
/sigma_e |   85.36123   5.101821    16.73   0.000     75.36184    95.36062
-------------+----------------------------------------------------------------
rho |   .4149226   .0737081                      .2791125    .5618572
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=   31.15 Prob>=chibar2 = 0.000

The addition of the i() option in the syntax below is used to account for the clustering of observations within dnum.

xtreg api00 growth emer yr_rnd, i(dnum) mle

Fitting constant-only model:
Iteration 0:   log likelihood = -1931.1472
Iteration 1:   log likelihood = -1925.0996
Iteration 2:   log likelihood =  -1924.698
Iteration 3:   log likelihood = -1924.6953

Fitting full model:
Iteration 0:   log likelihood = -1878.6289
Iteration 1:   log likelihood = -1877.6236
Iteration 2:   log likelihood = -1877.6093
Iteration 3:   log likelihood = -1877.6093

Random-effects ML regression                    Number of obs      =       309
Group variable (i): dnum                        Number of groups   =       186

Random effects u_i ~ Gaussian                   Obs per group: min =         1
avg =       1.7
max =        28

LR chi2(3)         =     94.17
Log likelihood  = -1877.6093                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.0980205   .2016164    -0.49   0.627    -.4931814    .2971403
emer |  -5.639125   .5695138    -9.90   0.000    -6.755351   -4.522898
yr_rnd |  -39.64472   18.43406    -2.15   0.032    -75.77481   -3.514637
_cons |   748.1934   11.97179    62.50   0.000     724.7291    771.6577
-------------+----------------------------------------------------------------
/sigma_u |   71.88487   8.258147     8.70   0.000      55.6992    88.07054
/sigma_e |   85.36123   5.101821    16.73   0.000     75.36184    95.36062
-------------+----------------------------------------------------------------
rho |   .4149226   .0737081                      .2791125    .5618572
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=   31.15 Prob>=chibar2 = 0.000

Stata 9 has a new command called xtmixed, which is very similar to SAS's proc mixed.  As with proc mixed, you can use either restricted maximum likelihood (reml, the default in xtmixed) or the full maximum likelihood (mle, an option in xtmixed).  Below, we will show both analyses.  Note that when the mle option is used, the results exactly match those above.

xtmixed api00 growth emer yr_rnd || dnum:, cov(id)

Performing EM optimization:

Iteration 0:   log restricted-likelihood =  -1871.185
Iteration 1:   log restricted-likelihood = -1871.1661
Iteration 2:   log restricted-likelihood = -1871.1661

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =       309
Group variable: dnum                            Number of groups   =       186

Obs per group: min =         1
avg =       1.7
max =        28

Wald chi2(3)       =    109.28
Log restricted-likelihood = -1871.1661          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |   -.097907   .2028458    -0.48   0.629    -.4954775    .2996635
emer |  -5.641348   .5647032    -9.99   0.000    -6.748146    -4.53455
yr_rnd |  -39.62703   18.53256    -2.14   0.032    -75.95019   -3.303876
_cons |   748.2184   12.00168    62.34   0.000     724.6955    771.7413
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
dnum: Identity               |
sd(_cons) |   72.53047   8.324643      57.91943    90.82735
-----------------------------+------------------------------------------------
sd(Residual) |   85.83346    5.14642       76.3168    96.53685
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =    31.40 Prob >= chibar2 = 0.0000
xtmixed api00 growth emer yr_rnd || dnum: , mle cov(id)

Performing EM optimization:

Iteration 0:   log likelihood = -1877.6283
Iteration 1:   log likelihood = -1877.6093
Iteration 2:   log likelihood = -1877.6093

Computing standard errors:

Mixed-effects ML regression                     Number of obs      =       309
Group variable: dnum                            Number of groups   =       186

Obs per group: min =         1
avg =       1.7
max =        28

Wald chi2(3)       =    110.67
Log likelihood = -1877.6093                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth |  -.0980206   .2015541    -0.49   0.627    -.4930593    .2970181
emer |  -5.639124   .5609501   -10.05   0.000    -6.738566   -4.539682
yr_rnd |  -39.64473   18.41733    -2.15   0.031    -75.74204   -3.547423
_cons |   748.1934   11.92048    62.77   0.000     724.8297    771.5571
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
dnum: Identity               |
sd(_cons) |   71.88479   8.258158      57.39189     90.0375
-----------------------------+------------------------------------------------
sd(Residual) |   85.36127   5.101829      75.92533     95.9699
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =    31.15 Prob >= chibar2 = 0.0000

#### Multilevel modeling in SAS

In SAS, there are at least two different ways that you can specify the random statement.  We have shown both in the code below, and just commented out one of them.  As before, the first analysis does not account for the clustering.

proc mixed data = "D:/temp/api2000";
model api00= growth emer yr_rnd / solution;
run;
The Mixed Procedure

Model Information

Data Set                     WC000001.API2000
Dependent Variable           API00
Covariance Structure         Diagonal
Estimation Method            REML
Residual Variance Method     Profile
Fixed Effects SE Method      Model-Based
Degrees of Freedom Method    Residual

Dimensions

Covariance Parameters             1
Columns in X                      4
Columns in Z                      0
Subjects                          1
Max Obs Per Subject             310

Number of Observations

Number of Observations Read             310
Number of Observations Used             309
Number of Observations Not Used           1

Covariance Parameter
Estimates

Cov Parm     Estimate

Residual        12441

Fit Statistics

-2 Res Log Likelihood          3773.7
AIC (smaller is better)        3775.7
AICC (smaller is better)       3775.7
BIC (smaller is better)        3779.4

Solution for Fixed Effects

Standard
Effect       Estimate       Error      DF    t Value    Pr > |t|

Intercept      740.40     11.5522     305      64.09      <.0001
GROWTH        -0.1027      0.2112     305      -0.49      0.6271
EMER          -5.4449      0.5395     305     -10.09      <.0001
YR_RND       -51.0757     19.9136     305      -2.56      0.0108

Type 3 Tests of Fixed Effects

Num     Den
Effect         DF      DF    F Value    Pr > F

GROWTH          1     305       0.24    0.6271
EMER            1     305     101.84    <.0001
YR_RND          1     305       6.58    0.0108

In this analysis, we will account for the clustering in the data.  We will need two statements to do this:  the class statement and the random statement.  As there are several different ways that you can specify the random statement, we have used one and commented out another.  Either will work and will give identical results.

proc mixed data = "D:/temp/api2000";
class dnum;
model api00= growth emer yr_rnd / solution;
random intercept / sub = dnum type = cs;
* random dnum / type = cs;
run;
The Mixed Procedure

Model Information

Data Set                     WORK.API2000
Dependent Variable           API00
Covariance Structure         Compound Symmetry
Subject Effect               DNUM
Estimation Method            REML
Residual Variance Method     Profile
Fixed Effects SE Method      Model-Based
Degrees of Freedom Method    Containment

Class Level Information
Class    Levels    Values
DNUM        186    1 6 10 27 30 41 42 48 56 58 59
62 63 76 80 90 92 93 96 98 103
108 111 114 126 130 131 137
138 140 148 153 158 159 160
166 171 172 177 180 187 189
196 197 198 200 204 209 211
215 216 219 229 230 231 232
233 236 237 238 248 253 259
261 266 270 272 273 275 284
294 298 300 301 306 315 328
333 339 351 364 365 386 390
395 397 401 403 411 416 418
427 439 448 449 453 463 469
470 471 473 476 479 480 484
487 491 498 499 501 507 510
511 512 521 529 532 536 537
541 542 547 558 560 564 570
572 575 584 586 590 596 600
605 608 611 614 615 619 620
621 627 630 632 635 638 639
643 644 650 653 654 660 661
663 668 671 684 686 688 689
691 702 705 706 711 738 745
750 754 756 763 766 774 777
782 784 787 796 803 806 811
813 815 825 832

Dimensions
Covariance Parameters             3
Columns in X                      4
Columns in Z Per Subject          1
Subjects                        186
Max Obs Per Subject              28

Number of Observations
Number of Observations Read             310
Number of Observations Used             309
Number of Observations Not Used           1

Iteration History
Iteration    Evaluations    -2 Res Log Like       Criterion
0              1      3773.72780489
1              2      3742.87774751      0.03842610
2              1      3742.34943339      0.00097140
3              1      3742.33218877      0.00000117
4              1      3742.33216695      0.00000000

Convergence criteria met but final hessian is not positive
definite.

Covariance Parameter Estimates
Cov Parm     Subject    Estimate
Variance     DNUM        4017.49
CS           DNUM        1243.18
Residual                 7367.38

Fit Statistics
-2 Res Log Likelihood          3742.3
AIC (smaller is better)        3748.3
AICC (smaller is better)       3748.4
BIC (smaller is better)        3758.0

Null Model Likelihood Ratio Test
DF    Chi-Square      Pr > ChiSq
2         31.40          <.0001

Solution for Fixed Effects
Standard
Effect       Estimate       Error      DF    t Value    Pr > |t|
Intercept      748.22     12.0017     185      62.34      <.0001
GROWTH       -0.09791      0.2028     120      -0.48      0.6302
EMER          -5.6413      0.5647     120      -9.99      <.0001
YR_RND       -39.6270     18.5326     120      -2.14      0.0345

Type 3 Tests of Fixed Effects
Num     Den
Effect         DF      DF    F Value    Pr > F
GROWTH          1     120       0.23    0.6302
EMER            1     120      99.80    <.0001
YR_RND          1     120       4.57    0.0345


#### Comparing the results from the different methods

The table below compares the results obtained from SAS and Stata using OLS regression (with no correction for the intraclass correlation), clustered robust standard errors, survey methods and multilevel modeling.  As you can see, the point estimates are the same for the OLS regression, the survey method and the clustered robust standard errors method.  The standard errors, however, are different.

 OLS Regression Survey Method Clustered Robust Standard Errors Multilevel Model coefficient standard error coefficient standard error coefficient standard error coefficient standard error Stata constant 740.3981 11.5521 740.3981 13.3950 740.3981 13.4608 748.1934 11.9205 growth -0.1027 0.2111 -0.1027 0.2281 -0.1027 0.2292 -0.0980 0.2016 emer -5.4449 0.5395 -5.4449 0.7258 -5.4449 0.7294 -5.6391 0.5610 yr_rnd -51.0756 19.9136 -51.0756 22.7247 -51.0757 22.8361 -39.6447 18.4173 SAS constant 740.3980 11.5521 740.3980 13.4607 740.3981 13.3590 748.22 12.0017 growth -0.1027 0.2111 -0.1027 0.2292 -0.1027 0.2274 -0.0979 0.2028 emer -5.4449 0.5395 -5.4449 0.7294 -5.4449 0.7239 -5.6413 0.5647 yr_rnd -51.0756 19.9136 -51.0756 22.8362 -51.0757 22.6635 -39.6270 18.5326

#### References

References for intraclass correlations (definition and calculation of):

Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence by Judith D. Singer and John B. Willett (page 96)
Stata Reference Manual G - M, pages 340-341
Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling by Tom Snijders and Roel Bosker (pages 16 - 22)

References for Huber-White standard errors

Analysis of Longitudinal Data, Second Edition by Peter J. Diggle, Patrick Heagerty, Kug-Yee Liang, and Scott L. Zeger (pages 70 - 80)
Applied Longitudinal Analysis by Garrett M. Fitzmaurice, Nan M. Laird and James H. Ware (pages 303 - 305)
Hierarchical Linear Models, Second Edition by Stephen Raudenbush and Anthony Bryk (pages 272 - 280)

References for survey data analysis
Stata 11 Survey manual
Combined Survey Sampling Inference by Ken Brewer
The Ideas of Sampling by Alan Stuart
Sampling: Design and Analysis by Sharon L. Lohr
Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard
Processing Data - The Survey Example by Linda B. Bourque and Virginia A. Clark
Sampling of Populations: Methods and Applications, Third Edition by Paul Levy and Stanley Lemeshow
Survey Research Methods, Third Edition by Floyd Fowler Jr.
Practical Methods for Design and Analysis of Complex Surveys, Second Edition by Risto Lehtonen and Erkki Pahkinen

References for multilevel modeling

Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw Hierarchical Linear Models, Second Edition by Stephen Raudenbush and Anthony Bryk
HLM 6 - Hierarchical and Nonlinear Modeling by Raudenbush, et. al.
Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling by Tom Snijders and Roel Bosker
Multilevel Modeling of Health Statistics Edited by A. H. Leyland and H. Goldstein
Multilevel Analysis: Techniques and Applications by Joop Hox
An Introduction to Multilevel Modeling Techniques by Ronald Heck and Scott Thomas
Multilevel Modeling by Douglas A. Luke
Multilevel Statistical Models by Harvey Goldstein (PDF, free download)
Multilevel Modeling: Methodological Advances, Issues, and Applications by Steven P. Reise and Naihua Duan
Topics in Modeling of Clustered Data by Marc Aerts Geert Molenberghs Helena Geys Louise Ryan  Read it Online! (UC Only)
Multilevel Statistical Models, Fourth Edition by Harvey Goldstein
Mixed Models:  Theory and Applications by Eugene Demidenko
Design and Analysis of Cluster Randomization Trials In Health Research by Allan Donner and Neil S. Klar

#### A note on the difference between robust standard errors and clustered robust standard errors

There are two differences between robust standard errors and clustered robust standard errors.  The first is that for robust standard errors, the unit is the observation, whereas for the clustered robust standard errors, the unit is the cluster.  The other difference is the calculation of the constant that is multiplied with the sandwich estimator:  for the robust standard error, it is n / (n - 1) and for the clustered robust standard errors, it is the number of clustered divided by the number of clusters minus 1.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.