Correlated data are fairly common in social science research. Husbands' responses to marital satisfaction questions are related to (correlated with) wives' responses. Parents' assessment of their child's achievement is correlated with the child's assessment of his or her achievement. Members of the same household are likely to be more similar on a wide variety of measures than to nonmembers. Sometimes the correlated nature of the data is obvious and is considered as the data are being collected. Other times, the correlated nature is less obvious and was not considered as the data were collected. Either way, to correctly analyze the data, the correlation needs to be taken into account. If it is not, the standard errors of the estimates will be off (usually underestimated), rendering significance tests invalid. This happens because the standard errors that are normally reported with an analysis assume that each observation is independent of all other observations in the data set. To the extent that this is not true (i.e., as the correlation becomes larger), each observation contain less unique information. (Another consequence of this is that the effective sample size is diminished.) This kind of correlation (between observations) is called an intraclass correlation. It is different from a Pearson correlation, which is between two variables.
So, how bad could ignoring the intraclass correlation be? We have reproduced Table 1.1 from Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw which shows what happens to the alpha values when you thought that they were 0.05. The rows of the table show different values of N, the number of subjects in the experiment or survey. The columns show different values of rho, the intraclass correlation coefficient. As you can see, if you have only 10 subjects and an intraclass correlation coefficient of 0.01, your true alpha value is 0.06, which is not much different from 0.05. However, if you have 100 subjects and an intraclass correlation coefficient of 0.20, your real alpha level is 0.70!
Alpha = 0.05, Table 1.1, page 10 from Introduction to Multilevel Modeling by Ita Kreft and Jan de Leeuw
|
rho |
|||
| N | 0.01 | 0.05 | 0.20 |
| 10 | 0.06 | 0.11 | 0.28 |
| 25 | 0.08 | 0.19 | 0.46 |
| 50 | 0.11 | 0.30 | 0.59 |
| 100 | 0.17 | 0.43 | 0.70 |
When people realize that they are analyzing correlated data, they often wonder about the possible strategies that are available to account for the correlation. Three possibilities come to mind: using survey commands with the psu (primary sampling unit) set, using clustered robust standard errors in a standard analysis (for example, a regression analysis), or using a multilevel (AKA hierarchical) model. These three methods yield slightly different results, and they are intended for use in different situations. Let's look at when you would use each of these methods and how they are different from each other. Because both SAS and Stata have commands for accomplishing these three types of analyses, we will focus on those packages. (A note on the difference between robust standard errors and clustered robust standard errors can be found at the bottom of this page.)
Before trying to correct for the intraclass correlation, you might ask "How large is the intraclass correlation?" This is a reasonable question. If the correlation is shown to be relatively small (however "relatively small" is defined), then one might choose to ignore the correlation and analyze the data in a standard way, knowing that the standard errors would be off by some, but not be much. In other words, the effort to correct the standard errors might outweigh the benefit. Remember that an intraclass correlation is much different from a Pearson correlation, so the standards that apply to a Pearson correlation do not apply to an intraclass correlation. For example, while a Pearson correlation of 0.3 may be considered small, an intraclass correlation of 0.3 is quite large. Also, while a Pearson correlation coefficient can be negative, an intraclass correlation coefficient cannot. (In rare cases, you can get a negative intraclass correlation coefficient, but that usually means that there is trouble with the data. Also, remember that an intraclass correlation coefficient describes the relationship of cases within a cluster, so if cases within a particular cluster are more similar to cases within a different cluster than to cases within its cluster, the coefficient would still be zero.)
There are several definitions of an intraclass correlation, and many different versions of the formula for its calculation. Many texts will show simplified versions of the formula that apply only to specific situations. We have opted to give a formula that uses elements of standard ANOVA output.

F is the F statistic from the ANOVA (see page 19 of Snijders and Bosker for formula for weighted average).
is the mean sample size
N is the number of clusters
M is the total sample size s-squared (put in real symbol here!) is the variance of the sample sizes
is the intraclass correlation
Perhaps the first question that needs to be addressed is how the data were collected. If the data were collected as part of a survey, and by survey we mean a survey with an explicit sampling plan, then using the survey commands in standard statistical software or using specialty survey software (such as SUDAAN or WesVar) is usually the most appropriate choice. (The only time that this can be problematic is when you want to conduct a true multilevel analysis and still include the features of sampling design. As of this writing, gllamm in Stata can handle most features of the survey design, as can Mplus 5.) The point here is that when survey data are collected, the correlational nature of the data is explicitly taken into account. In this framework, the intraclass correlation is seen as a nuisance that merely needs to be accounted for.
If the data were not collected as part of a survey, then you have the option of using clustered robust standard errors or using a multilevel model. First, let's discuss clustered robust standard errors, as they are, mathematically speaking, very similar to using survey techniques. Simply stated, it is method of inflating the standard errors. Most commonly, Huber-White (also called Sandwich or robust) standard errors are used. In the SAS documentation, this type of standard error is called an empirical standard error. As you will see below, the standard errors produced using this method are very close to those produced using survey methods. In fact, Stata's survey routine calls the same routine used to create clustered robust standard errors. The slight difference between the two is caused by a small difference in a constant multiplier (similar to an finite population correction or finite sample correction). (For more information regarding the difference between robust and clustered standard errors in Stata, please see section 20.14 (pages 275-280) of the User's Guide for Stata version 9). This method of correcting the standard errors to account for the intraclass correlation is a "weaker" form of correction than using a multilevel model, which not only accounts for the intraclass correlation, but also corrects the denominator degrees of freedom for the number of clusters. When you use clustered robust standard errors, the denominator degrees of freedom is based on the number of observations, not the number of clusters. Because variability in the predictors and residuals is removed when a multilevel model is used, the point estimates from this method are different than those obtained using the previous two methods. There are several helpful references for Huber-White standard errors, which we have listed at the end of this page.
Choosing between using clustered robust standard errors and multilevel modeling is not always easy. (Please note that while we are presenting these as two options, you can clustered robust standard errors with multilevel model estimates of fixed effects. We present these as two different options only in the context of starting with a ordinary least squares regression.) One factor to be considered is how many clusters you have. If you have relatively few, then using a multilevel model becomes less of an option because you need to have a fair number of clusters just to get the procedure to run. However, clustered robust standard errors also need a fair number of clusters in order to be reliably computed (please see the references at the end of this page for more on this). Also, if you are working with longitudinal data and the design is severely unbalanced, then clustered robust standard errors may not be a good option. Another consideration is audience of the research. If the audience is not familiar with multilevel modeling techniques or is not statistically sophisticated, then perhaps robust standard errors are a preferable way to proceed, since the type of analysis itself is not changed. A third consideration has to do with the researcher's comfort with the various techniques. If the researcher does not feel comfortable conducting or presenting a particular type of analysis, other choices may be preferable. We know of no references on statistical considerations for making the decision of what method to use to account for the intraclass correlation.
We will use the api data set, which contains the api scores for schools in California in the year 2000. The variable aip00 is the score, growth indicates the percent of growth experienced by the school in the last year, emer is the percent of teachers at that school with emergency credentials, and yr_rnd is a dummy (0/1) variable indicating if the school is on a year-round schedule. In our example, we will do a regression analysis with api00 as the dependent variable and growth, emer and yr_rnd as independent variables. The data are correlated; schools are nested within districts. The variable dnum contains the number of each school district. You can download the api00 data set in SAS format here. The data set in Stata format is available online from within Stata, as shown below.
For each method described, we will present two analyses. The first will ignore the correlation within the data, and the second will account for it. In this way, you can see how the results differ. We do not mean to suggest that in all cases the magnitude of the difference between the two analyses will be same. For some data sets, the difference will be much larger or smaller than what was obtained here.
In "survey speak", the clustering variable is called a primary sampling unit, or PSU. When doing the sampling for a survey, the PSU is the first unit of sampling. For example, when taking a national sample, one might sample states, and then counties within states and then cities within counties. In this example, states would be the primary sampling unit, since that was the first thing that was sampled. If the population was defined as counties in the United States, then counties would be the first thing sampled and they would constitute the PSU. In each case, it is easy to see that observations with a cluster may be more similar than observations between clusters. We have online seminars with more information on the various features of sampling designs Introduction to Survey Data Analysis and Survey Data Analysis with Stata . Also, for more information regarding the analysis of survey data and how the various elements of the sampling design are used by survey commands, please see pages 5 - 13 of the Stata 9 Survey manual.
First, let's ignore the cluster variable and conduct a regular regression.
use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs, clear
regress api00 growth emer yr_rnd
Source | SS df MS Number of obs = 309
-------------+------------------------------ F( 3, 305) = 38.94
Model | 1453469.16 3 484489.72 Prob > F = 0.0000
Residual | 3794617.4 305 12441.3685 R-squared = 0.2770
-------------+------------------------------ Adj R-squared = 0.2698
Total | 5248086.56 308 17039.2421 Root MSE = 111.54
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth | -.1027121 .2111831 -0.49 0.627 -.5182723 .3128481
emer | -5.444932 .5395432 -10.09 0.000 -6.506631 -4.383234
yr_rnd | -51.07569 19.91364 -2.56 0.011 -90.2612 -11.89018
_cons | 740.3981 11.55215 64.09 0.000 717.6661 763.1301
------------------------------------------------------------------------------
Now we will use the survey commands and account for the clustering by using the svyset command to tell Stata that the PSU is a variable called dnum.
svyset dnum
svy: regress api00 growth emer yr_rnd
Survey linear regression
pweight: <none> Number of obs = 309
Strata: <one> Number of strata = 1
PSU: dnum Number of PSUs = 186
Population size = 309
F( 3, 183) = 19.69
Prob > F = 0.0000
R-squared = 0.2770
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth | -.1027121 .2280515 -0.45 0.653 -.552628 .3472038
emer | -5.444932 .725836 -7.50 0.000 -6.876912 -4.012952
yr_rnd | -51.07569 22.72466 -2.25 0.026 -95.90849 -6.242884
_cons | 740.3981 13.39504 55.27 0.000 713.9714 766.8248
------------------------------------------------------------------------------
As you can see from the results above, while the coefficients are the same in the two analyses, the standard errors are smaller in the analysis that does not account for the clustering in the data.
Now we will run the same two analyses in SAS.
proc reg data = "D:/temp/api2000"; model api00= growth emer yr_rnd; run;The REG Procedure Model: MODEL1 Dependent Variable: API00 Number of Observations Read 310 Number of Observations Used 309 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 1453469 484490 38.94 <.0001 Error 305 3794617 12441 Corrected Total 308 5248087 Root MSE 111.54088 R-Square 0.2770 Dependent Mean 663.17476 Adj R-Sq 0.2698 Coeff Var 16.81923 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 740.39808 11.55215 64.09 <.0001 GROWTH 1 -0.10271 0.21118 -0.49 0.6271 EMER pct emer credential 1 -5.44493 0.53954 -10.09 <.0001 YR_RND 1 -51.07569 19.91364 -2.56 0.0108proc surveyreg data = "D:/temp/api2000"; cluster dnum; model api00= growth emer yr_rnd; run;Regression Analysis for Dependent Variable API00 Data Summary Number of Observations 309 Mean of API00 663.17476 Sum of API00 204921.0 Design Summary Number of Clusters 186 Fit Statistics R-square 0.2770 Root MSE 111.54 Denominator DF 185 Tests of Model Effects Effect Num DF F Value Pr > F Model 3 19.72 <.0001 Intercept 1 3025.46 <.0001 GROWTH 1 0.20 0.6545 EMER 1 55.73 <.0001 YR_RND 1 5.00 0.0265 NOTE: The denominator degrees of freedom for the F tests is 185. Estimated Regression Coefficients Standard Parameter Estimate Error t Value Pr > |t| Intercept 740.398084 13.4607592 55.00 <.0001 GROWTH -0.102712 0.2291703 -0.45 0.6545 EMER -5.444932 0.7293969 -7.46 <.0001 YR_RND -51.075689 22.8361501 -2.24 0.0265 NOTE: The denominator degrees of freedom for the t tests is 185.
As previously stated, this method is very similar to the survey method. The difference is that when you select this method, your data were not collected using a sampling plan. Rather, the correlations between observations are present despite the use of simple random sampling (or a convenience sample).
In the first regression, we will analyze the data as if there was no correlation between schools within districts. In the second analysis, we will use the cluster option to obtain robust standard errors. As you can see from the output below, the point estimates given in the two outputs are the same, but the standard errors are not. The each of the robust standard errors are larger than the standard error for that variable in the first analysis. Also notice that while the R-squared and Root MSE are the same in the two analyses, the value of the F-test is different. This is because the denominator degrees of freedom is different. In the first analysis, it is 305 (there are 310 observations in the data set), while in the second analysis it is 185. (As shown in the output, there are 186 clusters.) Also note that there is no Adjusted R-squared given in the second analysis.
use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/srs, clear regress api00 growth emer yr_rnd Source | SS df MS Number of obs = 309 -------------+------------------------------ F( 3, 305) = 38.94 Model | 1453469.16 3 484489.72 Prob > F = 0.0000 Residual | 3794617.4 305 12441.3685 R-squared = 0.2770 -------------+------------------------------ Adj R-squared = 0.2698 Total | 5248086.56 308 17039.2421 Root MSE = 111.54 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | -.1027121 .2111831 -0.49 0.627 -.5182723 .3128481 emer | -5.444932 .5395432 -10.09 0.000 -6.506631 -4.383234 yr_rnd | -51.07569 19.91364 -2.56 0.011 -90.2612 -11.89018 _cons | 740.3981 11.55215 64.09 0.000 717.6661 763.1301 ------------------------------------------------------------------------------regress api00 growth emer yr_rnd, cluster(dnum) Regression with robust standard errors Number of obs = 309 F( 3, 185) = 19.72 Prob > F = 0.0000 R-squared = 0.2770 Number of clusters (dnum) = 186 Root MSE = 111.54 ------------------------------------------------------------------------------ | Robust api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | -.1027121 .2291703 -0.45 0.655 -.5548352 .3494111 emer | -5.444932 .7293969 -7.46 0.000 -6.883938 -4.005927 yr_rnd | -51.07569 22.83615 -2.24 0.027 -96.12844 -6.022935 _cons | 740.3981 13.46076 55.00 0.000 713.8418 766.9544 ------------------------------------------------------------------------------
In SAS, you can use proc genmod to account for clustering in data.
proc genmod data = "D:/temp/srs"; model api00= growth emer yr_rnd; run;The GENMOD Procedure Model Information Data Set D:/temp/srs Written by SAS Distribution Normal Link Function Identity Dependent Variable API00 Number of Observations Read 310 Number of Observations Used 309 Missing Values 1 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 305 3794617.4031 12441.3685 Scaled Deviance 305 309.0000 1.0131 Pearson Chi-Square 305 3794617.4031 12441.3685 Scaled Pearson X2 305 309.0000 1.0131 Log Likelihood -1893.1858 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 740.3981 11.4771 717.9033 762.8929 4161.63 <.0001 GROWTH 1 -0.1027 0.2098 -0.5139 0.3085 0.24 0.6245 EMER 1 -5.4449 0.5360 -6.4956 -4.3943 103.18 <.0001 YR_RND 1 -51.0757 19.7843 -89.8523 -12.2991 6.66 0.0098 Scale 1 110.8166 4.4577 102.4152 119.9072 NOTE: The scale parameter was estimated by maximum likelihood.
Note that you have to have the class statement before the repeated statement, or you will get an error message.
proc genmod data = "D:/temp/srs"; class dnum; model api00= growth emer yr_rnd; repeated subject = dnum; run;<some output omitted to save space>
Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept 740.3981 13.3590 714.2150 766.5812 55.42 <.0001 GROWTH -0.1027 0.2274 -0.5485 0.3431 -0.45 0.6516 EMER -5.4449 0.7239 -6.8637 -4.0261 -7.52 <.0001 YR_RND -51.0757 22.6635 -95.4953 -6.6561 -2.25 0.0242
You will notice that the point estimates in this second analysis match those obtained using Stata, but the standard error are slightly different. That is because Stata uses a constant similar to a finite population correction (fpc) called a finite sample correction (page 351-352) when calculating robust standard errors, while SAS does not. To compare the formulas used by Stata and SAS for calculating the standard errors, please see Stata 8 Reference manual N - R, pages 336-341 and the online SAS documentation http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_sect39.htm. To apply the fsc or to see the formula used, please see our SAS Code Fragment on Getting Robust Standard Errors for Clustered Data.
When using a multilevel modeling technique to account for the intraclass correlation, you need to make sure that you have random intercepts. This is the default in Stata's xtreg command. This is necessary because our cluster variable is a random variable. If your cluster variable is not a random variable, you can still use this method, but you will have to do some extra work to get the correct denominator. We understand that the explanation provided here is not exhaustive. Please see our Statistics Books for Loan for suggested reading for more information.
xtset dnum
xtreg api00 growth emer yr_rnd, mle
Fitting constant-only model:
Iteration 0: log likelihood = -1931.1472
Iteration 1: log likelihood = -1925.0996
Iteration 2: log likelihood = -1924.698
Iteration 3: log likelihood = -1924.6953
Fitting full model:
Iteration 0: log likelihood = -1878.6289
Iteration 1: log likelihood = -1877.6236
Iteration 2: log likelihood = -1877.6093
Iteration 3: log likelihood = -1877.6093
Random-effects ML regression Number of obs = 309
Group variable (i): dnum Number of groups = 186
Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 1.7
max = 28
LR chi2(3) = 94.17
Log likelihood = -1877.6093 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api00 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth | -.0980205 .2016164 -0.49 0.627 -.4931814 .2971403
emer | -5.639125 .5695138 -9.90 0.000 -6.755351 -4.522898
yr_rnd | -39.64472 18.43406 -2.15 0.032 -75.77481 -3.514637
_cons | 748.1934 11.97179 62.50 0.000 724.7291 771.6577
-------------+----------------------------------------------------------------
/sigma_u | 71.88487 8.258147 8.70 0.000 55.6992 88.07054
/sigma_e | 85.36123 5.101821 16.73 0.000 75.36184 95.36062
-------------+----------------------------------------------------------------
rho | .4149226 .0737081 .2791125 .5618572
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)= 31.15 Prob>=chibar2 = 0.000
The addition of the i() option in the syntax below is used to account for the clustering of observations within dnum.
xtreg api00 growth emer yr_rnd, i(dnum) mle
Fitting constant-only model:
Iteration 0: log likelihood = -1931.1472
Iteration 1: log likelihood = -1925.0996
Iteration 2: log likelihood = -1924.698
Iteration 3: log likelihood = -1924.6953
Fitting full model:
Iteration 0: log likelihood = -1878.6289
Iteration 1: log likelihood = -1877.6236
Iteration 2: log likelihood = -1877.6093
Iteration 3: log likelihood = -1877.6093
Random-effects ML regression Number of obs = 309
Group variable (i): dnum Number of groups = 186
Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 1.7
max = 28
LR chi2(3) = 94.17
Log likelihood = -1877.6093 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api00 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
growth | -.0980205 .2016164 -0.49 0.627 -.4931814 .2971403
emer | -5.639125 .5695138 -9.90 0.000 -6.755351 -4.522898
yr_rnd | -39.64472 18.43406 -2.15 0.032 -75.77481 -3.514637
_cons | 748.1934 11.97179 62.50 0.000 724.7291 771.6577
-------------+----------------------------------------------------------------
/sigma_u | 71.88487 8.258147 8.70 0.000 55.6992 88.07054
/sigma_e | 85.36123 5.101821 16.73 0.000 75.36184 95.36062
-------------+----------------------------------------------------------------
rho | .4149226 .0737081 .2791125 .5618572
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)= 31.15 Prob>=chibar2 = 0.000
Stata 9 has a new command called xtmixed, which is very similar to SAS's proc mixed. As with proc mixed, you can use either restricted maximum likelihood (reml, the default in xtmixed) or the full maximum likelihood (mle, an option in xtmixed). Below, we will show both analyses. Note that when the mle option is used, the results exactly match those above.
xtmixed api00 growth emer yr_rnd || dnum:, cov(id) Performing EM optimization: Performing gradient-based optimization: Iteration 0: log restricted-likelihood = -1871.185 Iteration 1: log restricted-likelihood = -1871.1661 Iteration 2: log restricted-likelihood = -1871.1661 Computing standard errors: Mixed-effects REML regression Number of obs = 309 Group variable: dnum Number of groups = 186 Obs per group: min = 1 avg = 1.7 max = 28 Wald chi2(3) = 109.28 Log restricted-likelihood = -1871.1661 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | -.097907 .2028458 -0.48 0.629 -.4954775 .2996635 emer | -5.641348 .5647032 -9.99 0.000 -6.748146 -4.53455 yr_rnd | -39.62703 18.53256 -2.14 0.032 -75.95019 -3.303876 _cons | 748.2184 12.00168 62.34 0.000 724.6955 771.7413 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ dnum: Identity | sd(_cons) | 72.53047 8.324643 57.91943 90.82735 -----------------------------+------------------------------------------------ sd(Residual) | 85.83346 5.14642 76.3168 96.53685 ------------------------------------------------------------------------------ LR test vs. linear regression: chibar2(01) = 31.40 Prob >= chibar2 = 0.0000xtmixed api00 growth emer yr_rnd || dnum: , mle cov(id) Performing EM optimization: Performing gradient-based optimization: Iteration 0: log likelihood = -1877.6283 Iteration 1: log likelihood = -1877.6093 Iteration 2: log likelihood = -1877.6093 Computing standard errors: Mixed-effects ML regression Number of obs = 309 Group variable: dnum Number of groups = 186 Obs per group: min = 1 avg = 1.7 max = 28 Wald chi2(3) = 110.67 Log likelihood = -1877.6093 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | -.0980206 .2015541 -0.49 0.627 -.4930593 .2970181 emer | -5.639124 .5609501 -10.05 0.000 -6.738566 -4.539682 yr_rnd | -39.64473 18.41733 -2.15 0.031 -75.74204 -3.547423 _cons | 748.1934 11.92048 62.77 0.000 724.8297 771.5571 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ dnum: Identity | sd(_cons) | 71.88479 8.258158 57.39189 90.0375 -----------------------------+------------------------------------------------ sd(Residual) | 85.36127 5.101829 75.92533 95.9699 ------------------------------------------------------------------------------ LR test vs. linear regression: chibar2(01) = 31.15 Prob >= chibar2 = 0.0000
In SAS, there are at least two different ways that you can specify the random statement. We have shown both in the code below, and just commented out one of them. As before, the first analysis does not account for the clustering.
proc mixed data = "D:/temp/api2000"; model api00= growth emer yr_rnd / solution; run;The Mixed Procedure Model Information Data Set WC000001.API2000 Dependent Variable API00 Covariance Structure Diagonal Estimation Method REML Residual Variance Method Profile Fixed Effects SE Method Model-Based Degrees of Freedom Method Residual Dimensions Covariance Parameters 1 Columns in X 4 Columns in Z 0 Subjects 1 Max Obs Per Subject 310 Number of Observations Number of Observations Read 310 Number of Observations Used 309 Number of Observations Not Used 1 Covariance Parameter Estimates Cov Parm Estimate Residual 12441 Fit Statistics -2 Res Log Likelihood 3773.7 AIC (smaller is better) 3775.7 AICC (smaller is better) 3775.7 BIC (smaller is better) 3779.4 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| Intercept 740.40 11.5522 305 64.09 <.0001 GROWTH -0.1027 0.2112 305 -0.49 0.6271 EMER -5.4449 0.5395 305 -10.09 <.0001 YR_RND -51.0757 19.9136 305 -2.56 0.0108 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F GROWTH 1 305 0.24 0.6271 EMER 1 305 101.84 <.0001 YR_RND 1 305 6.58 0.0108
In this analysis, we will account for the clustering in the data. We will need two statements to do this: the class statement and the random statement. As there are several different ways that you can specify the random statement, we have used one and commented out another. Either will work and will give identical results.
proc mixed data = "D:/temp/api2000"; class dnum; model api00= growth emer yr_rnd / solution; random intercept / sub = dnum type = cs; * random dnum / type = cs; run;The Mixed Procedure Model Information Data Set WORK.API2000 Dependent Variable API00 Covariance Structure Compound Symmetry Subject Effect DNUM Estimation Method REML Residual Variance Method Profile Fixed Effects SE Method Model-Based Degrees of Freedom Method Containment Class Level Information Class Levels Values DNUM 186 1 6 10 27 30 41 42 48 56 58 59 62 63 76 80 90 92 93 96 98 103 108 111 114 126 130 131 137 138 140 148 153 158 159 160 166 171 172 177 180 187 189 196 197 198 200 204 209 211 215 216 219 229 230 231 232 233 236 237 238 248 253 259 261 266 270 272 273 275 284 294 298 300 301 306 315 328 333 339 351 364 365 386 390 395 397 401 403 411 416 418 427 439 448 449 453 463 469 470 471 473 476 479 480 484 487 491 498 499 501 507 510 511 512 521 529 532 536 537 541 542 547 558 560 564 570 572 575 584 586 590 596 600 605 608 611 614 615 619 620 621 627 630 632 635 638 639 643 644 650 653 654 660 661 663 668 671 684 686 688 689 691 702 705 706 711 738 745 750 754 756 763 766 774 777 782 784 787 796 803 806 811 813 815 825 832 Dimensions Covariance Parameters 3 Columns in X 4 Columns in Z Per Subject 1 Subjects 186 Max Obs Per Subject 28 Number of Observations Number of Observations Read 310 Number of Observations Used 309 Number of Observations Not Used 1 Iteration History Iteration Evaluations -2 Res Log Like Criterion 0 1 3773.72780489 1 2 3742.87774751 0.03842610 2 1 3742.34943339 0.00097140 3 1 3742.33218877 0.00000117 4 1 3742.33216695 0.00000000 Convergence criteria met but final hessian is not positive definite. Covariance Parameter Estimates Cov Parm Subject Estimate Variance DNUM 4017.49 CS DNUM 1243.18 Residual 7367.38 Fit Statistics -2 Res Log Likelihood 3742.3 AIC (smaller is better) 3748.3 AICC (smaller is better) 3748.4 BIC (smaller is better) 3758.0 Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 2 31.40 <.0001 Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| Intercept 748.22 12.0017 185 62.34 <.0001 GROWTH -0.09791 0.2028 120 -0.48 0.6302 EMER -5.6413 0.5647 120 -9.99 <.0001 YR_RND -39.6270 18.5326 120 -2.14 0.0345 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F GROWTH 1 120 0.23 0.6302 EMER 1 120 99.80 <.0001 YR_RND 1 120 4.57 0.0345
The table below compares the results obtained from SAS and Stata using OLS regression (with no correction for the intraclass correlation), clustered robust standard errors, survey methods and multilevel modeling. As you can see, the point estimates are the same for the OLS regression, the survey method and the clustered robust standard errors method. The standard errors, however, are different.
| OLS Regression | Survey Method | Clustered Robust Standard Errors | Multilevel Model | ||||||
| coefficient | standard error | coefficient | standard error | coefficient | standard error | coefficient | standard error | ||
| Stata | constant | 740.3981 | 11.5521 | 740.3981 | 13.3950 | 740.3981 | 13.4608 | 748.1934 | 11.9205 |
| growth | -0.1027 | 0.2111 | -0.1027 | 0.2281 | -0.1027 | 0.2292 | -0.0980 | 0.2016 | |
| emer | -5.4449 | 0.5395 | -5.4449 | 0.7258 | -5.4449 | 0.7294 | -5.6391 | 0.5610 | |
| yr_rnd | -51.0756 | 19.9136 | -51.0756 | 22.7247 | -51.0757 | 22.8361 | -39.6447 | 18.4173 | |
| SAS | constant | 740.3980 | 11.5521 | 740.3980 | 13.4607 | 740.3981 | 13.3590 | 748.22 | 12.0017 |
| growth | -0.1027 | 0.2111 | -0.1027 | 0.2292 | -0.1027 | 0.2274 | -0.0979 | 0.2028 | |
| emer | -5.4449 | 0.5395 | -5.4449 | 0.7294 | -5.4449 | 0.7239 | -5.6413 | 0.5647 | |
| yr_rnd | -51.0756 | 19.9136 | -51.0756 | 22.8362 | -51.0757 | 22.6635 | -39.6270 | 18.5326 | |
References for intraclass correlations (definition and calculation of):
Applied Longitudinal Data Analysis:
Modeling Change and Event Occurrence by Judith D. Singer and John B. Willett
(page 96)
Stata Reference Manual G - M, pages 340-341
Multilevel Analysis:
An Introduction to Basic and Advanced Multilevel Modeling by Tom Snijders
and Roel Bosker (pages 16 - 22)
References for Huber-White standard errors
Analysis of Longitudinal Data, Second Edition by Peter J. Diggle, Patrick
Heagerty, Kug-Yee Liang, and Scott L. Zeger (pages 70 - 80)
Applied
Longitudinal Analysis by Garrett M. Fitzmaurice, Nan M. Laird and James H.
Ware (pages 303 - 305)
Hierarchical Linear Models, Second Edition by Stephen Raudenbush and
Anthony Bryk (pages 272 - 280)
References for survey data analysis
Stata 11 Survey manual
Combined Survey Sampling Inference by Ken Brewer
The Ideas of Sampling by Alan Stuart
Sampling: Design and Analysis by
Sharon L. Lohr
Analysis of Health Surveys by
Edward L. Korn and Barry I. Graubard
Processing Data - The Survey Example by Linda B. Bourque and Virginia A.
Clark
Sampling of Populations: Methods and
Applications, Third Edition by Paul Levy and Stanley Lemeshow
Survey Research Methods, Third Edition by Floyd Fowler Jr.
Practical Methods for Design and Analysis of Complex Surveys, Second Edition
by Risto Lehtonen and Erkki Pahkinen
References for multilevel modeling
Introduction to Multilevel
Modeling by Ita Kreft and Jan de Leeuw
Hierarchical Linear Models, Second Edition by Stephen Raudenbush and
Anthony Bryk
HLM 5 - Hierarchical and Nonlinear Modeling by Raudenbush, et. al.
Multilevel Analysis: An
Introduction to Basic and Advanced Multilevel Modeling by Tom Snijders
and Roel Bosker
Multilevel Modeling of Health Statistics Edited by A. H. Leyland and H.
Goldstein
Multilevel Analysis:
Techniques and Applications by Joop Hox
An Introduction to
Multilevel Modeling Techniques by Ronald Heck and Scott Thomas
Multilevel
Modeling by Douglas A. Luke
Applied
Multilevel Analysis by J.J. Hox (PDF, free download)
Multilevel
Statistical Models by Harvey Goldstein (PDF, free download)
Multilevel Modeling: Methodological Advances, Issues, and Applications
by Steven P. Reise and Naihua Duan
Topics in Modeling of Clustered Data by Marc Aerts Geert Molenberghs
Helena Geys Louise Ryan Read it Online! (UC Only)
Multilevel
Statistical Models, Third Edition by Harvey Goldstein
Mixed Models: Theory and Applications by Eugene Demidenko
Design and Analysis of Cluster Randomization Trials In Health Research
by Allan Donner and Neil S. Klar
There are two differences between robust standard errors and clustered robust standard errors. The first is that for robust standard errors, the unit is the observation, whereas for the clustered robust standard errors, the unit is the cluster. The other difference is the calculation of the constant that is multiplied with the sandwich estimator: for the robust standard error, it is n / (n - 1) and for the clustered robust standard errors, it is the number of clustered divided by the number of clusters minus 1.
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.