### Stata Library Survey Sampling Examples using Stata 9

#### Introduction

Survey data generally have one or more of these three characteristics:
• sampling or probability weights
• clustering
• stratification

Stata takes theses characteristics into account through the use of survey procedures. Before issuing any survey commands it is necessary to set one or more of the following items:

• svyset pweight varname1 -- sets sampling weights
• svyset strata varname2 -- sets the strata
• svyset psu varname3 -- sets the primary sampling unit (cluster)
• svyset fpc varname4 -- set the finite population correction

Failure to analyze survey sampling designs without taking these characteristics into account can result in inaccurate point estimates and/or inaccurate estimates of standard errors.

In this unit we will be using data from the book Sampling of Populations by Levy and Lemeshow (1999) with permission of the authors.

#### Some Definitions

sampling fraction
The proportion of the population being sampled. If a district has 30 elementary schools and you sample four of them, then your population is 30 and your sampling fraction is 4/30.
pweight
The sampling weight which is the reciprocal of the sampling fraction. From our previous example, if the sampling fraction is 4/30 then the pweight is 30/4 = 7.5. Thus, each school in our sample represents 7.5 schools.
cluster
Groups, such as, counties, city blocks, schools, or households, that are sampled as a group.
psu
Many sampling designs involve multistage sampling, i.e., multiple levels of clusters. The psu indicates the first or primary level of clusters that are sampled.
fpc
finite population control is used in simple random sampling without replacement. fpc indicates the total number of psu's in the population.
strata
The division of a population into parts known strata for the purposes of drawing samples.
sampling frame
A list of all the elements in the population with some chance of being selected.

#### The Population

California requires that all students in public schools be tested each year. The State Department of Education then puts together the annual Academic Performance Index (API) which rates how a school is doing overall, in terms of the test scores. The file, apipop.dta, contains api ratings and demographic information on 6,194 schools in 757 school districts. To be included in the file schools must have at least 100 students.

Of course, in the normal course of events you wouldn't actually have access to data from the whole population. We were lucky in this instance that California collects and releases these data.

Let's try several computations on the population data.

use http://www.ats.ucla.edu/stat/stata/library/apipop, clear

tabulate stype

stype |      Freq.     Percent        Cum.
------------+-----------------------------------
E |       4421       71.38       71.38
H |        755       12.19       83.56
M |       1018       16.44      100.00
------------+-----------------------------------
Total |       6194      100.00

summarize api00

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
api00 |    6194    664.7126   128.2441        346        969

quietly summarize enroll

display %10.0fc r(sum)
3,811,472

regress api00 meals ell avg_ed

Source |       SS       df       MS                  Number of obs =    6016
---------+------------------------------               F(  3,  6012) = 5837.12
Model |  73775065.7     3  24591688.6               Prob > F      =  0.0000
Residual |  25328472.8  6012  4212.98616               R-squared     =  0.7444
Total |  99103538.5  6015  16476.0662               Root MSE      =  64.908

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
meals |  -1.672069   .0568866    -29.393   0.000      -1.783587   -1.560551
ell |  -.6775632   .0616073    -10.998   0.000      -.7983355   -.5567908
avg_ed |   72.30502    2.09055     34.587   0.000       68.20679    76.40325
_cons |    558.443   7.969069     70.076   0.000       542.8207    574.0652
------------------------------------------------------------------------------


#### Simple Random Sampling Example

Let's take a simple random sample of 200 schools from the population file. This can be accomplished with the commands:
generate i = uniform()
sort i . keep in 1/200 
In this example, the sampling frame contains the 6,194 school so fpc = 6194 and the sampling weights (pw) = 6194/200 = 30.97.

Of course, in the real world you probably wouldn't take a sample of 200 school from a computer file of 6,194, you would just analyze the entire dataset. But suppose you had to go out to each school to collect the data that you needed, then it would take much less time and cost much less money to go to 200 schools than to over 6,000 schools.

The file apisrs.dta has a simple random sample of 200 cases.
use http://www.ats.ucla.edu/stat/stata/library/apisrs, clear

tabulate stype

stype |      Freq.     Percent        Cum.
------------+-----------------------------------
E |        145       72.50       72.50
H |         25       12.50       85.00
M |         30       15.00      100.00
------------+-----------------------------------
Total |        200      100.00

tabulate dnum

district |
number |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          1        0.50        0.50
40 |          1        0.50        1.00
41 |          1        0.50        1.50
43 |          1        0.50        2.00
46 |          3        1.50        3.50
48 |          1        0.50        4.00
55 |          1        0.50        4.50
56 |          2        1.00        5.50
57 |          1        0.50        6.00
60 |          1        0.50        6.50
67 |          1        0.50        7.00
80 |          1        0.50        7.50
90 |          2        1.00        8.50
98 |          1        0.50        9.00
103 |          1        0.50        9.50
105 |          1        0.50       10.00
108 |          2        1.00       11.00
124 |          1        0.50       11.50
131 |          1        0.50       12.00
135 |          2        1.00       13.00
148 |          2        1.00       14.00
154 |          1        0.50       14.50
159 |          1        0.50       15.00
162 |          1        0.50       15.50
166 |          3        1.50       17.00
175 |          1        0.50       17.50
176 |          1        0.50       18.00
184 |          1        0.50       18.50
190 |          1        0.50       19.00
209 |          1        0.50       19.50
217 |          1        0.50       20.00
222 |          1        0.50       20.50
229 |          1        0.50       21.00
231 |          1        0.50       21.50
238 |          1        0.50       22.00
248 |          2        1.00       23.00
253 |          3        1.50       24.50
255 |          1        0.50       25.00
259 |          1        0.50       25.50
266 |          1        0.50       26.00
272 |          1        0.50       26.50
274 |          1        0.50       27.00
278 |          2        1.00       28.00
293 |          1        0.50       28.50
301 |          1        0.50       29.00
304 |          1        0.50       29.50
335 |          1        0.50       30.00
351 |          1        0.50       30.50
352 |          1        0.50       31.00
353 |          1        0.50       31.50
358 |          1        0.50       32.00
360 |          1        0.50       32.50
379 |          1        0.50       33.00
390 |          1        0.50       33.50
393 |          1        0.50       34.00
395 |          2        1.00       35.00
401 |         18        9.00       44.00
416 |          1        0.50       44.50
418 |          2        1.00       45.50
436 |          1        0.50       46.00
444 |          1        0.50       46.50
445 |          1        0.50       47.00
451 |          1        0.50       47.50
457 |          2        1.00       48.50
459 |          1        0.50       49.00
460 |          1        0.50       49.50
470 |          1        0.50       50.00
473 |          1        0.50       50.50
479 |          1        0.50       51.00
491 |          1        0.50       51.50
495 |          1        0.50       52.00
498 |          1        0.50       52.50
503 |          2        1.00       53.50
507 |          5        2.50       56.00
509 |          1        0.50       56.50
513 |          2        1.00       57.50
529 |          2        1.00       58.50
532 |          1        0.50       59.00
533 |          1        0.50       59.50
536 |          1        0.50       60.00
537 |          2        1.00       61.00
539 |          3        1.50       62.50
541 |          1        0.50       63.00
542 |          1        0.50       63.50
547 |          1        0.50       64.00
556 |          2        1.00       65.00
564 |          1        0.50       65.50
570 |          1        0.50       66.00
579 |          1        0.50       66.50
590 |          1        0.50       67.00
600 |          1        0.50       67.50
602 |          1        0.50       68.00
605 |          1        0.50       68.50
614 |          2        1.00       69.50
620 |          3        1.50       71.00
623 |          1        0.50       71.50
627 |          3        1.50       73.00
629 |          1        0.50       73.50
630 |          2        1.00       74.50
632 |          5        2.50       77.00
633 |          1        0.50       77.50
635 |          1        0.50       78.00
636 |          2        1.00       79.00
637 |          1        0.50       79.50
640 |          1        0.50       80.00
642 |          1        0.50       80.50
643 |          1        0.50       81.00
644 |          1        0.50       81.50
645 |          1        0.50       82.00
648 |          1        0.50       82.50
651 |          1        0.50       83.00
653 |          1        0.50       83.50
658 |          1        0.50       84.00
665 |          1        0.50       84.50
688 |          1        0.50       85.00
689 |          1        0.50       85.50
702 |          1        0.50       86.00
711 |          1        0.50       86.50
716 |          1        0.50       87.00
720 |          1        0.50       87.50
731 |          1        0.50       88.00
739 |          1        0.50       88.50
744 |          3        1.50       90.00
745 |          1        0.50       90.50
750 |          1        0.50       91.00
751 |          1        0.50       91.50
754 |          1        0.50       92.00
756 |          1        0.50       92.50
761 |          1        0.50       93.00
779 |          2        1.00       94.00
780 |          1        0.50       94.50
782 |          1        0.50       95.00
788 |          1        0.50       95.50
796 |          4        2.00       97.50
797 |          1        0.50       98.00
803 |          1        0.50       98.50
815 |          1        0.50       99.00
830 |          1        0.50       99.50
834 |          1        0.50      100.00
------------+-----------------------------------
Total |        200      100.00

svyset

pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: <observations>
FPC 1: fpc

svy: mean api00

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =     200
Number of PSUs   =     200          Population size  =    6194
Design df        =     199

--------------------------------------------------------------
|             Linearized
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
api00 |    660.165   9.186887      642.0489    678.2811
--------------------------------------------------------------

svy: total enroll

(running total on estimation sample)

Survey: Total estimation

Number of strata =       1          Number of obs    =     200
Number of PSUs   =     200          Population size  =    6194
Design df        =     199

--------------------------------------------------------------
|             Linearized
|      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
enroll |    3924828   220705.4       3489607     4360049
--------------------------------------------------------------

svy: regress api00 meals ell avg_ed

(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =       200
Number of PSUs     =       200                  Population size    = 6193.9999
Design df          =       199
F(   3,    197)    =    217.11
Prob > F           =    0.0000
R-squared          =    0.7640

------------------------------------------------------------------------------
|             Linearized
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals |  -1.367668   .3544273    -3.86   0.000    -2.066583   -.6687524
ell |  -1.266818   .3895673    -3.25   0.001    -2.035028   -.4986079
avg_ed |   75.49145   14.28649     5.28   0.000     47.31912    103.6638
_cons |   544.7082   56.15402     9.70   0.000     433.9749    655.4414
------------------------------------------------------------------------------

#### Stratified Random Sampling Example

This time instead of taking a simple random sample of the whole population we will take separate simple random samples of elementary schools, high school and middle schools. This is known as stratified random sampling. We will sample 100 elementary schools, 50 high schools and 50 middle schools.

In this example, there are three sampling frames: 4,421 elementary schools, 755 high schools, and 1,018 middle schools.

The file apistrat.dta contains the data for the stratified random sample.
use http://www.ats.ucla.edu/stat/stata/library/apistrat, clear

tabulate stype

stype |      Freq.     Percent        Cum.
------------+-----------------------------------
E |        100       50.00       50.00
H |         50       25.00       75.00
M |         50       25.00      100.00
------------+-----------------------------------
Total |        200      100.00

tabulate dnum

district |
number |      Freq.     Percent        Cum.
------------+-----------------------------------
19 |          1        0.50        0.50
20 |          1        0.50        1.00
25 |          1        0.50        1.50
27 |          1        0.50        2.00
40 |          1        0.50        2.50
41 |          1        0.50        3.00
64 |          1        0.50        3.50
69 |          1        0.50        4.00
105 |          1        0.50        4.50
108 |          1        0.50        5.00
114 |          1        0.50        5.50
135 |          1        0.50        6.00
140 |          1        0.50        6.50
148 |          2        1.00        7.50
153 |          5        2.50       10.00
155 |          1        0.50       10.50
158 |          2        1.00       11.50
160 |          1        0.50       12.00
162 |          1        0.50       12.50
176 |          1        0.50       13.00
182 |          1        0.50       13.50
185 |          2        1.00       14.50
196 |          1        0.50       15.00
202 |          1        0.50       15.50
208 |          1        0.50       16.00
214 |          1        0.50       16.50
215 |          2        1.00       17.50
216 |          1        0.50       18.00
223 |          1        0.50       18.50
225 |          1        0.50       19.00
226 |          1        0.50       19.50
233 |          1        0.50       20.00
238 |          2        1.00       21.00
247 |          1        0.50       21.50
253 |          4        2.00       23.50
259 |          4        2.00       25.50
266 |          2        1.00       26.50
270 |          2        1.00       27.50
273 |          1        0.50       28.00
275 |          1        0.50       28.50
279 |          1        0.50       29.00
284 |          1        0.50       29.50
294 |          1        0.50       30.00
308 |          1        0.50       30.50
316 |          1        0.50       31.00
324 |          1        0.50       31.50
333 |          1        0.50       32.00
339 |          1        0.50       32.50
348 |          1        0.50       33.00
349 |          1        0.50       33.50
351 |          1        0.50       34.00
358 |          1        0.50       34.50
364 |          1        0.50       35.00
376 |          1        0.50       35.50
382 |          2        1.00       36.50
390 |          1        0.50       37.00
394 |          1        0.50       37.50
395 |          3        1.50       39.00
401 |         16        8.00       47.00
419 |          1        0.50       47.50
423 |          1        0.50       48.00
432 |          1        0.50       48.50
439 |          1        0.50       49.00
448 |          1        0.50       49.50
450 |          1        0.50       50.00
457 |          1        0.50       50.50
459 |          1        0.50       51.00
460 |          1        0.50       51.50
465 |          1        0.50       52.00
473 |          3        1.50       53.50
475 |          1        0.50       54.00
478 |          1        0.50       54.50
484 |          1        0.50       55.00
492 |          1        0.50       55.50
495 |          1        0.50       56.00
497 |          1        0.50       56.50
498 |          1        0.50       57.00
499 |          1        0.50       57.50
501 |          1        0.50       58.00
507 |          4        2.00       60.00
509 |          1        0.50       60.50
512 |          1        0.50       61.00
513 |          2        1.00       62.00
514 |          1        0.50       62.50
515 |          1        0.50       63.00
531 |          2        1.00       64.00
532 |          1        0.50       64.50
537 |          1        0.50       65.00
541 |          3        1.50       66.50
550 |          1        0.50       67.00
554 |          1        0.50       67.50
569 |          1        0.50       68.00
575 |          2        1.00       69.00
590 |          2        1.00       70.00
596 |          1        0.50       70.50
602 |          2        1.00       71.50
605 |          1        0.50       72.00
620 |          2        1.00       73.00
621 |          3        1.50       74.50
627 |          1        0.50       75.00
630 |          2        1.00       76.00
632 |          4        2.00       78.00
635 |          2        1.00       79.00
636 |          2        1.00       80.00
639 |          2        1.00       81.00
650 |          1        0.50       81.50
653 |          2        1.00       82.50
655 |          1        0.50       83.00
656 |          1        0.50       83.50
662 |          1        0.50       84.00
685 |          1        0.50       84.50
689 |          5        2.50       87.00
702 |          1        0.50       87.50
706 |          1        0.50       88.00
722 |          1        0.50       88.50
725 |          2        1.00       89.50
735 |          1        0.50       90.00
738 |          1        0.50       90.50
751 |          1        0.50       91.00
756 |          1        0.50       91.50
760 |          1        0.50       92.00
766 |          1        0.50       92.50
767 |          2        1.00       93.50
774 |          1        0.50       94.00
780 |          2        1.00       95.00
781 |          1        0.50       95.50
784 |          1        0.50       96.00
787 |          1        0.50       96.50
796 |          1        0.50       97.00
797 |          1        0.50       97.50
802 |          1        0.50       98.00
806 |          1        0.50       98.50
813 |          1        0.50       99.00
819 |          1        0.50       99.50
825 |          1        0.50      100.00
------------+-----------------------------------
Total |        200      100.00

svyset

pweight: pw
VCE: linearized
Strata 1: stype
SU 1: <observations>
FPC 1: fpc

svy: mean api00

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       3          Number of obs    =     200
Number of PSUs   =     200          Population size  =    6194
Design df        =     197

--------------------------------------------------------------
|             Linearized
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
api00 |   662.2874   9.408941      643.7322    680.8425
--------------------------------------------------------------

svy: total enroll

(running total on estimation sample)

Survey: Total estimation

Number of strata =       3          Number of obs    =     200
Number of PSUs   =     200          Population size  =    6194
Design df        =     197

--------------------------------------------------------------
|             Linearized
|      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
enroll |    3687178   114641.7       3461095     3913260
--------------------------------------------------------------

svy: regress api00 meals ell avg_ed

(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         3                  Number of obs      =       200
Number of PSUs     =       200                  Population size    =      6194
Design df          =       197
F(   3,    195)    =    190.97
Prob > F           =    0.0000
R-squared          =    0.7125

------------------------------------------------------------------------------
|             Linearized
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals |  -1.818234   .4076227    -4.46   0.000    -2.622098    -1.01437
ell |  -.0191524   .3890413    -0.05   0.961    -.7863727    .7480679
avg_ed |   77.47879   16.93665     4.57   0.000     44.07838    110.8792
_cons |   534.4453   65.57342     8.15   0.000     405.1294    663.7613
------------------------------------------------------------------------------

#### One-Stage Cluster Sampling

Another approach to sampling from the population is cluster sampling. In this example we will use school districts as the cluster or primary sampling units. We will take a random sample of 15 school districts and look at all of the schools in each one.

In this example, the sampling frame contains the 757 school districts.

The file apiclus1.dta will contain the data for the one-stage cluster sampling design.
use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear

tabulate stype

stype |      Freq.     Percent        Cum.
------------+-----------------------------------
E |        144       78.69       78.69
H |         14        7.65       86.34
M |         25       13.66      100.00
------------+-----------------------------------
Total |        183      100.00

tabulate dnum

district |
number |      Freq.     Percent        Cum.
------------+-----------------------------------
61 |         13        7.10        7.10
135 |         34       18.58       25.68
178 |          4        2.19       27.87
197 |         13        7.10       34.97
255 |         16        8.74       43.72
406 |          2        1.09       44.81
413 |          1        0.55       45.36
437 |          4        2.19       47.54
448 |         12        6.56       54.10
510 |         21       11.48       65.57
568 |          9        4.92       70.49
637 |         11        6.01       76.50
716 |         37       20.22       96.72
778 |          2        1.09       97.81
815 |          4        2.19      100.00
------------+-----------------------------------
Total |        183      100.00

svyset dnum [pw=pw], fpc(fpc)

pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: dnum
FPC 1: fpc

/* list fpc pw dnum -- to see the values for these items */

svy: mean api00

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =     183
Number of PSUs   =      15          Population size  =  9235.4
Design df        =      14

--------------------------------------------------------------
|             Linearized
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
api00 |   644.1694   23.54224      593.6763    694.6625
--------------------------------------------------------------

svy: total enroll

(running total on estimation sample)

Survey: Total estimation

Number of strata =       1          Number of obs    =     183
Number of PSUs   =      15          Population size  =  9235.4
Design df        =      14

--------------------------------------------------------------
|             Linearized
|      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
enroll |    5076846    1389984       2095626     8058066
--------------------------------------------------------------

svy: regress api00 meals ell avg_ed

(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =       157
Number of PSUs     =        15                  Population size    = 9235.4001
Design df          =        14
F(   3,     12)    =     54.36
Prob > F           =    0.0000
R-squared          =    0.6978

------------------------------------------------------------------------------
|             Linearized
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals |  -2.948702   .3266161    -9.03   0.000    -3.649224    -2.24818
ell |  -.2227005   .3938377    -0.57   0.581    -1.067398    .6219974
avg_ed |   16.42832   15.32151     1.07   0.302    -16.43304    49.28968
_cons |   755.4386   55.61202    13.58   0.000     636.1626    874.7145
------------------------------------------------------------------------------

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.