Statistical Computing Seminars Arrays in SAS

You can view movies of this seminar via the links below.

This seminar is designed to help you improve your SAS data management skills via the use of arrays.

Here is the program on which the seminar is based.

In this seminar we will cover the following ten topics:

Recoding variables
Applying math computations to many variables simultaneously
Computing new variables
Collapsing over variables
Identify patterns across variables using arrays
Reshaping wide to long
Understanding the functions first., last. and the retain statement
Reshaping long to wide using arrays
Comparisons across observations using arrays

First we run the SAS options so that we can get rid of the date, page number, centering and page break in the output.

options nodate nonumber nocenter formdlim="-";

Recoding variables

Inputting the faminc data set.

data faminc;
input famid faminc1-faminc12 ;
cards;
1 3281 3413 3114 2500 2700 3500 3114 3319 3514 1282 2434 2818
2 4042 3084 3108 3150 3800 3100 1531 2914 3819 4124 4274 4471
3 6015 6123 6113 6100 6100 6200 6186 6132 3123 4231 6039 6215
;
run;

Recoding manually using if-then.

data recode_manual;
set faminc;
if faminc1 < 3000 then faminc1=.;
if faminc2 < 3000 then faminc2=.;
if faminc3 < 3000 then faminc3=.;
if faminc4 < 3000 then faminc4=.;
if faminc5 < 3000 then faminc5=.;
if faminc6 < 3000 then faminc6=.;
if faminc7 < 3000 then faminc7=.;
if faminc8 < 3000 then faminc8=.;
if faminc9 < 3000 then faminc9=.;
if faminc10 < 3000 then faminc10=.;
if faminc11 < 3000 then faminc11=.;
if faminc12 < 3000 then faminc12=.;
run;
/*heading option specifies horizontal (H) column headings/*
proc print data=recode_manual noobs heading=H;
var famid faminc1-faminc6;
run;

famid    faminc1    faminc2    faminc3    faminc4    faminc5    faminc6
1        3281       3413       3114          .          .       3500
2        4042       3084       3108       3150       3800       3100
3        6015       6123       6113       6100       6100       6200

Recoding with arrays using if-then.
Note: In the code we use the square brackets around the subscript variable i. The choice between square brackets, curly brackets or parenthesis is completely arbitrary. We have decided to use the square brackets as a visual reminder that i is a subscript and not a part of a mathematical computation.

data recode_array;
set faminc;
array Afaminc(12) faminc1-faminc12;
do i = 1 to 12;
if Afaminc[i] < 3000 then Afaminc[i] = . ;
end;
drop i;
run;
proc print data=recode_array noobs heading=H;
var famid faminc1-faminc6;
run;

famid    faminc1    faminc2    faminc3    faminc4    faminc5    faminc6
1        3281       3413       3114          .          .       3500
2        4042       3084       3108       3150       3800       3100
3        6015       6123       6113       6100       6100       6200

Applying the same math computation to many variables at a time

Reverse items on a -3 to +3 scale using array.

data score;
input item1 item2 item3 item4;
cards;
-2   1   -3   0
-1   2   -2   1
0  -1   -3  -1
;
run;
data score_array1;
set score;
array item(4) item1-item4;
do i=1 to 4;
item[i] = -1*item[i];
end;
run;
proc print data=score_array1;
run;

Obs    item1    item2    item3    item4    i
1       2        -1       3         0     5
2       1        -2       2        -1     5
3       0         1       3         1     5

Computing new variables

Computing the tax income variables manually.

data tax_manual;
set faminc;
taxinc1 = faminc1 * .10 ;
taxinc2 = faminc2 * .10 ;
taxinc3 = faminc3 * .10 ;
taxinc4 = faminc4 * .10 ;
taxinc5 = faminc5 * .10 ;
taxinc6 = faminc6 * .10 ;
taxinc7 = faminc7 * .10 ;
taxinc8 = faminc8 * .10 ;
taxinc9 = faminc9 * .10 ;
taxinc10= faminc10 * .10 ;
taxinc11= faminc11 * .10 ;
taxinc12= faminc12 * .10 ;
run;
proc print data=tax_manual noobs;
var famid faminc1-faminc3 taxinc1-taxinc3;
run;

famid    faminc1    faminc2    faminc3    taxinc1    taxinc2    taxinc3
1        3281       3413       3114      328.1      341.3      311.4
2        4042       3084       3108      404.2      308.4      310.8
3        6015       6123       6113      601.5      612.3      611.

Computing the same tax income variables using an array. We have to use two arrays because the first array, Afaminc, is the array for the existing variables (faminc1-faminc12); the second array, Ataxinc, is created as a "placeholder" where we will store the new variables (taxinc1-taxinc12).

data tax_array;
set faminc;
array Afaminc(12) faminc1-faminc12; /* existing variables */
array Ataxinc(12) taxinc1-taxinc12; /* new variables */
do month = 1 to 12;
Ataxinc[month] = Afaminc[month]*0.1;
end;
run;
proc print data=tax_array noobs;
var famid faminc1-faminc3 taxinc1-taxinc3;
run;

famid    faminc1    faminc2    faminc3    taxinc1    taxinc2    taxinc3
1        3281       3413       3114      328.1      341.3      311.4
2        4042       3084       3108      404.2      308.4      310.8
3        6015       6123       6113      601.5      612.3      611.

Collapsing over variables

Creating the total income per quarter variables manually.

data quarter_manual;
set faminc;
incq1 = faminc1 + faminc2 + faminc3;
incq2 = faminc4 + faminc5 + faminc6;
incq3 = faminc7 + faminc8 + faminc9;
incq4 = faminc10 + faminc11 + faminc12;
run;
proc print data=quarter_manual;
var incq1 faminc1-faminc3;
run;

Obs    incq1    faminc1    faminc2    faminc3
1      9808      3281       3413       3114
2     10234      4042       3084       3108
3     18251      6015       6123       6113

Creating the total income per quarter variables using arrays.

data quarter_array;
set faminc;
array Afaminc(12) faminc1-faminc12; /*existing vars*/
array Aquarter(4) incq1-incq4; /* new vars */
do q = 1 to 4;
Aquarter[q] = Afaminc[3*q-2] + Afaminc[3*q-1] + Afaminc[3*q];
end;
run;
/* For q=1:  Aquarter[1] = Afaminc[3*1-2] + Afaminc[3*1-1] + Afaminc[3*1]
= Afaminc[1] + Afaminc[2] + Afaminc[3]
For q=2:  Aquarter[2] = Afaminc[3*2-2] + Afaminc[3*2-1] + Afaminc[3*2]
= Afaminc[4] + Afaminc[5] + Afaminc[6] */
proc print data=quarter_array nobs;
var famid incq1 faminc1-faminc3;
run;

famid    incq1    faminc1    faminc2    faminc3
1      9808      3281       3413       3114
2     10234      4042       3084       3108
3     18251      6015       6123       6113

Identify patterns across variables using arrays

In this section the objective is to identify the months in which income was less than half of previous month and store information in the dummy variables lowinc2-lowinc12 looping over months 2-12. Note that month 1 has no previous month! The variable ever indicates if income has ever been less than half of a previous month for any month.
Note: The array "size" specified in the parenthesis is usually one number and it is understood by SAS that it is supposed to create an array where the index ranges from one to the number in the parenthesis. But we can specify any range for the index which suits our program. We are only interested in lowincome variables corresponding to months 2-12 and thus we indicate that the range for the index of array Alowinc should be 2 to 12.

data pattern;
set faminc;
length ever $4; array Afaminc(12) faminc1-faminc12; /* existing vars */ array Alowinc(2:12) lowinc2-lowinc12; /* new vars */ do m = 2 to 12; if Afaminc[m] < (Afaminc[m-1] / 2) then Alowinc[m] = 1; else Alowinc[m] = 0; end; sum_low = sum(of lowinc:); /*sums over all vars with lowinc as part of name*/ if sum_low > 0 then ever='Yes'; if sum_low = 0 then ever='No'; drop m sum_low; run; proc print data=pattern noobs heading=H; var famid faminc1-faminc6 lowinc2-lowinc6 ever; run; famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc6 1 3281 3413 3114 2500 2700 3500 2 4042 3084 3108 3150 3800 3100 3 6015 6123 6113 6100 6100 6200 lowinc2 lowinc3 lowinc4 lowinc5 lowinc6 ever 0 0 0 0 0 Yes 0 0 0 0 0 Yes 0 0 0 0 0 No Reshaping wide to long Reshaping wide to long creating only one variable--manually. In the problem data set we show what happens when we forget to include the appropriate output statements in the data step. data wide; input famid faminc96 faminc97 faminc98 ; cards; 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000 ; run; data long_manual; set wide; year=96; faminc=faminc96; output; year=97; faminc=faminc97; output; year=98; faminc=faminc98; output; run; proc print data=long_manual; var famid year faminc; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000 data problem; set wide; year=96; faminc=faminc96; *output; year=97; faminc=faminc97; *output; year=98; faminc=faminc98; output; run; proc print data=problem; var famid year faminc; run; Obs famid year faminc 1 1 98 41000 2 2 98 45800 3 3 98 77000 Reshaping wide to long creating only one variable using arrays. data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000 Reshaping wide to long creating multiple variables (including string variables) using arrays. data multi_wide; input famid faminc96 faminc97 faminc98 spend96 spend97 spend98 debt96$ debt97 $debt98$ ;
cards;
1 40000 40500 41000 38000 39000 40000 yes yes no
2 45000 45400 45800 42000 43000 44000 yes no  no
3 75000 76000 77000 70000 71000 72000 no  no  no
;
run;
data multi_long;
set multi_wide;
length debt $3; array Afaminc(96:98) faminc96-faminc98; array Aspend(96:98) spend96-spend98; array Adebt(96:98) debt96-debt98; do year = 96 to 98; faminc = Afaminc[year]; spend = Aspend[year]; debt = Adebt[year]; output; end; drop faminc96-faminc98 spend96-spend98; run; proc print data=multi_long; var famid year faminc spend debt; run; Obs famid year faminc spend debt 1 1 96 40000 38000 yes 2 1 97 40500 39000 yes 3 1 98 41000 40000 no 4 2 96 45000 42000 yes 5 2 97 45400 43000 no 6 2 98 45800 44000 no 7 3 96 75000 70000 no 8 3 97 76000 71000 no 9 3 98 77000 72000 no Reshaping wide to long in presence of character suffixes. In the above example we had numeric suffixes (96, 97 and 98). We can reshape even if we have character suffixes such as old, now and future. data character; length name_old$ 24 name_now $24 name_future$ 24;
input id name_old $name_now$ name_future $inc_old inc_now inc_future; cards; 1 Ramon Martin Martin_Sheen 23000 50000 700000 2 John Johnnie J_boy 10000 20000 600000 3 Mary_Cathleen Bo Bo_Derek 15000 40000 250000 ; run; proc print data=character; run; data character_array; set character; length name$ 24;
array Aname(3) $name_old name_now name_future; array Aincome(3) inc_old inc_now inc_future; do time = 1 to 3; name = Aname[time]; income = Aincome[time]; output; end; run; proc format; value t_format 1='old' 2='now' 3='future'; run; proc print data=character_array ; format time t_format.; var id time name income; run; Obs id time name income 1 1 old Ramon 23000 2 1 now Martin 50000 3 1 future Martin_Sheen 700000 4 2 old John 10000 5 2 now Johnnie 20000 6 2 future J_boy 600000 7 3 old Mary_Cathleen 15000 8 3 now Bo 40000 9 3 future Bo_Derek 250000 Understanding the functions first., last. and the retain statement The previous section demonstrated how to reshape data sets from wide to long. Unfortunately, reshaping data sets from long to wide is more complex. In order to better understand how to use arrays to reshape from long to wide we will need to understand how the first. and last. functions work as well as understand how the retain statement works. The following are examples of the retain statement. We would like to create a new variable called new_meas which contains the same values as measurement but with the missing values filled in. The new_meas variable should have a starting value of 0 and then change values every time measurement has a non-missing value. data missings; input id measurement; cards; 1 . 1 2 3 . 2 3 3 4 2 . 3 . 1 . 3 5 3 6 ; run; data ex_retain; set missings; retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 3 5 3 4 4 6 2 . 4 7 3 . 4 8 1 . 4 9 3 5 5 10 3 6 6 Omitting the retain statement gives us the wrong new_meas, now it is just a copy of measurement. data ex_retain; set missings; *retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . . 2 1 2 2 3 3 . . 4 2 3 3 5 3 4 4 6 2 . . 7 3 . . 8 1 . . 9 3 5 5 10 3 6 6 In the next example we want to create a variable called new1 which contains the cumulative sum of the values in the variable measurement. Note that when measurement is missing the sum should remain unchanged. data ex2_retain; set missings; retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 5 5 3 4 9 6 2 . 9 7 3 . 9 8 1 . 9 9 3 5 14 10 3 6 20 Omitting the retain statement gives us the wrong new1. data ex2_retain; set missings; *retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . . 2 1 2 . 3 3 . . 4 2 3 . 5 3 4 . 6 2 . . 7 3 . . 8 1 . . 9 3 5 . 10 3 6 . Looking at the first. and last. functions. In the first example we create indicator variables, first and last. The variable first indicates the first observation for each person as indicated by id; the variable last indicates the last observation for each person. Note: When using first.var_name or last.var_name we must first sort the data set on the variable var_name. Moreover, in the data step we must always precede first.var_name or last.var_name with a by var_name statement. proc sort data=missings out=sort_miss; by id; run; data ex1; set sort_miss; by id; if first.id then first=1; else first=0; if last.id then last=1; else last=0; run; proc print data=ex1; run; Obs id measurement first last 1 1 . 1 0 2 1 2 0 0 3 1 . 0 1 4 2 3 1 0 5 2 . 0 1 6 3 . 1 0 7 3 4 0 0 8 3 . 0 0 9 3 5 0 0 10 3 6 0 1 Combining the first. function with a retain statement to get a cumulative sum and count. data kids; length kidname$ 4;
input famid kidname birth_order wt;
cards;
1 Beth 1  60
1 Barb 3  20
4 Sam  1 100
4 Stu  2  90
1 Bob  2  40
3 Pete 1  60
3 Phil 3  20
2 Andy 1  80
3 Pam  2  40
2 Al   2  50
2 Ann  3  20
;
run;

We will be using first.famid so we must sort the data set on famid.

proc sort data=kids out=sort_kids;
by famid;
run;
data retain1;
set sort_kids;
retain sumwt count; /*carry over the value from previous obs to next obs*/
by famid;
if first.famid then do; /*at 1st obs of each family set sumwt and count = 0*/
sumwt=0;
count=0;
end;
sumwt = sumwt + wt;
count = count + 1;
meanwt = sumwt/count;
run;
proc print data=retain1;
var famid kidname wt sumwt count meanwt;
run;

famid    kidname     wt    sumwt    count    meanwt
1       Beth       60      60       1         60
1       Barb       20      80       2         40
1       Bob        40     120       3         40
2       Andy       80      80       1         80
2       Al         50     130       2         65
2       Ann        20     150       3         50
3       Pete       60      60       1         60
3       Phil       20      80       2         40
3       Pam        40     120       3         40
4       Sam       100     100       1        100
4       Stu        90     190       2         95

By adding an if last.famid statement to the program we output only the last observation per family which shows the final sumwt, count and meanwt for each family.
Note: We do not need to resort the data since it is already sorted on famid.

data retain2;
set retain1;
by famid;
if last.famid then output; /*output only the last obs for each family*/
run;
proc print data=retain2;
var famid sumwt meanwt;
run;

famid    sumwt    count    meanwt
1       120       3        40
2       150       3        50
3       120       3        40
4       190       2        95

Reshaping long to wide using arrays

We will use the long_array data set created from the wide data set and we will reshape it back to the original wide format.

proc print data=long_array;
run;

Obs    famid    year    faminc
1       1       96      40000
2       1       97      40500
3       1       98      41000
4       2       96      45000
5       2       97      45400
6       2       98      45800
7       3       96      75000
8       3       97      76000
9       3       98      77000

We will be using first.famid so we must sort the data set on famid.

proc sort data=long_array out=long_sort;
by famid;
run;
data wide_array;
set long_sort;
by famid;
retain faminc96-faminc98;
array Afaminc(96:98) faminc96-faminc98;
if first.famid then do;
do i = 96 to 98;
Afaminc[i] = .; /*initializing to missing*/
end;
end;
Afaminc(year) = faminc; /*looping across values in the variable year*/
*if last.famid then output; /* outputs only the last obs in a family*/
drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
1        40000           .           .
1        40000       40500           .
1        40000       40500       41000
2        45000           .           .
2        45000       45400           .
2        45000       45400       45800
3        75000           .           .
3        75000       76000           .
3        75000       76000       77000

data wide_array;
set long_sort;
by famid;
retain faminc96-faminc98;
array Afaminc(96:98) faminc96-faminc98;
if first.famid then do;
do i = 96 to 98;
Afaminc[i] = .;
end;
end;
Afaminc(year) = faminc; /*looping across values in the variable year*/
if last.famid then output; /* outputs only the last obs in a family*/
drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
1        40000       40500       41000
2        45000       45400       45800
3        75000       76000       77000

Comparisons across observations using arrays

A more subtle usage of arrays. One issue in SAS data management is that we cannot do comparisons across observations. One solution to this problem is to transpose the data from long to wide; then we can use the array to do the comparisons very easily.
The goal is to compare each observation with the previous and the next observation. If they are the same then flag the observation.

data real_life;
input person topicA;
cards;
1   0
1   1
3  -1
1   0
2   0
1   1
2  -1
2  -1
3   0
3   1
4   0
1   1
4   1
4   0
2  -1
4   0
4   0
1  -1
;
run;

We need to number the observations within each person. We will be using first.person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.

proc sort data=real_life out=sort_real;
by person;
run;
data count_real;
set sort_real;
retain count;
by person;
if first.person then count = 0;
count = count + 1;
run;
proc print data=count_real noobs;
run;
topic
person      A      count
1         0       1
1         1       2
1         0       3
1         1       4
1         1       5
1        -1       6
2         0       1
2        -1       2
2        -1       3
2        -1       4
3        -1       1
3         0       2
3         1       3
4         0       1
4         1       2
4         0       3
4         0       4
4         0       5

We now convert the data set from long to wide.
Note: We are using first.person and last.person but we do not need to resort the data since it is already sorted on person.

data wide_real;
set count_real;
array AtopicA(6) topicA_1-topicA_6;
retain topicA_1-topicA_6;
by person;
if first.person then do;
do i = 1 to 6;
AtopicA[i] = .;
end;
end;
AtopicA(count) = topicA; /*looping across values in the variable count*/
if last.person then output; /* outputs only the last obs per person */
run;
proc print data=wide_real noobs;
var person topicA_1-topicA_6;
run;
topic    topic    topic    topic    topic    topic    flag
person     A_1      A_2      A_3      A_4      A_5      A_6       A
1         0        1        0        1       1        -1       0
2         0       -1       -1       -1       .         .       1
3        -1        0        1        .       .         .       0
4         0        1        0        0       0         .       1

Now, let's find the people who have the same value for 3 observations in a row.

data three;
set wide_real;
array topic(6) topicA_1-topicA_6;
do i = 2 to 5;
if topic[i-1] ne . & topic[i] ne . & topic[i+1] ne . &
topic[i]=topic[i-1] & topic[i]=topic[i+1] then flagA=1;
end;
if flagA=. then flagA=0;
run;
proc print data=three noobs;
var person topicA_1-topicA_6 flagA;
run;

topic    topic    topic    topic    topic    topic    flag
person     A_1      A_2      A_3      A_4      A_5      A_6       A

1         0        1        0        1       1        -1       0
2         0       -1       -1       -1       .         .       1
3        -1        0        1        .       .         .       0
4         0        1        0        0       0         .       1

This seminar is based on examples from the following SAS Learning Modules shown below.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.