Help the Stat Consulting Group by giving a gift

Given a knot location, how can I fit a piecewise model using Proc Transreg?

Imagine you wish to predict temperature using time of day. You believe
that the temperature increases until a certain time **t** and decreases after
that point and you wish to estimate two functions that will meet at **t**. Given
this location at which you believe the relationship between your outcome and predictor
changes, you can fit two functions to meet at
that given point in SAS using **proc transreg**.

For this example, we will use a fictional dataset where our outcome variable
is **y**, our predictor variable is **x**, and the relationship
between **x** and **y** is clearly not a single continuous function. We have
borrowed this example data from SAS examples.

data a; x=-0.000001; do i=0 to 199; if mod(i,50)=0 then do; c=((x/2)-5)**2; if i=150 then c=c+5; y=c; end; x=x+0.1; y=y-sin(x-c); output; end; run; proc print data = a (obs = 5); run;Obs X I C Y 1 0.10000 0 25.0000 24.7694 2 0.20000 1 25.0000 24.4427 3 0.30000 2 25.0000 24.0234 4 0.40000 3 25.0000 23.5155 5 0.50000 4 25.0000 22.9241proc gplot data = a; plot y*x; run;

We may have some reason to believe that there is a "knot" located at **x** = 10
and that the relationship between **x** and **y** is linear. By
saying there is a knot here, we're suggesting that the relationship between **x**
to **y** for **x** < 10 is different from the relationship for **x** >
10. Rather than fit one line or curve where we believe there are actually
two, we can indicate the location of the knot and fit a piecewise model.
We will do this using proc transreg. We will indicate the location of the
knot with the **knots** option and specify that we wish to fit one-degree
polynomials (lines) with the **degree** option. We will output a
dataset, **a2**, that includes the coefficients for these lines.

proc transreg data = a; model identity(y) = pspline(x / knots = 10 degree = 1); output out = a2 coefficients; run;TRANSREG Univariate Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------- 1 0.00000 0.00000 0.48219 Converged

By fitting this model, we are effectively estimating one intercept and two
slopes. The output from **proc transreg** does not include these
estimates as we would see them after fitting a regression with **proc reg**,
but the estimates have been added to the output dataset **a2**. The output
dataset also includes the transformed outcome variable, as many powers of the
predictor **x** are needed for polynomials of the indicated degree, and as
many modified versions of **x** as are required by the number of knots. We
can look at a few records in our output dataset to see this.

proc print data = a2 (obs = 5); where x > 9.8; run;Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 TIntercept TX_1 TX_2 X 99 SCORE ROW99 -5.85961 -5.85961 1 9.9000 0.00000 1 9.9000 0.00000 9.9000 100 SCORE ROW100 -5.28805 -5.28805 1 10.0000 0.00000 1 10.0000 0.00000 10.0000 101 SCORE ROW101 0.62507 0.62507 1 10.1000 0.10000 1 10.1000 0.10000 10.1000 102 SCORE ROW102 1.32494 1.32494 1 10.2000 0.20000 1 10.2000 0.20000 10.2000 103 SCORE ROW103 2.09263 2.09263 1 10.3000 0.30000 1 10.3000 0.30000 10.3000

In this example, we used the "identity" transformation of y, so the
transformed variable **ty** is equal to **y**. We indicated
one-degree polynomials, so variables **intercept** = **x**^0 and **x_1**
= **x**^1=** x** were created. No higher powers of **x** were generated. We
indicated one knot placed at **x** = 10, so a variable **x_2** was
generated. **x_2** is zero when **x** < 10 and is **x** - 10 when
**x** > 10. For example, we can see that when **x** = 10.1,**
x_2** = 0.1. Transformations of **intercept**, **x_1**, and **x_2**
also appear in the dataset. In this example, the transformed values are
equal to the untransformed.

The last row of the generated dataset contains the coefficients from the
model. In this row, the value for **_TYPE_** is "M COEFFI". We can
print this line to see the coefficients.

proc print data = a2; where _TYPE_ eq "M COEFFI"; run;Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 TIntercept TX_1 TX_2 X 201 M COEFFI TY . . . . . 16.9614 -1.47976 3.94202 .

Here we see the intercept (16.9614), the slope of the first line (-1.47976) and the
slope of the second line (3.94202). We will see the same estimates if we
use **proc reg**:

proc reg data = a2; model y = x_1 x_2; output out = a3 predicted = pred; run;Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 8181.68207 4090.84103 91.72 <.0001 Error 197 8786.23476 44.60018 Corrected Total 199 16968 Root MSE 6.67834 R-Square 0.4822 Dependent Mean 12.04335 Adj R-Sq 0.4769 Coeff Var 55.45246 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 16.96137 1.26019 13.46 <.0001 X_1 X 1 1 -1.47976 0.18399 -8.04 <.0001 X_2 X 2 1 3.94202 0.32717 12.05 <.0001

We can also see that the R-square value in the above regression is the same that appeared in the **proc transreg** output. We can plot the predicted values
and see that they consist of two lines that meet at** x** = 10.

legend label=none value=('y' 'predicted y') position=(bottom left inside) mode=share down = 2; proc gplot data = a3; plot (y pred)*x / overlay legend = legend; run; quit;

If we have reason to believe that relationship between **x** and **y**
is quadratic, we can indicate this in **proc transreg** and fit two
two-degree polynomials to the data. We will again run **proc tranreg**, this time
with **degree** = 2, and output a
dataset, **aquad**.

proc transreg data = a; model identity(y) = pspline(x / knots = 10 degree = 2); output out = aquad coefficients; run;TRANSREG Univariate Algorithm Iteration History for Identity(Y) Iteration Average Maximum Criterion Number Change Change R-Square Change Note ------------------------------------------------------------------------- 1 0.00000 0.00000 0.46274 Converged

In this model, **proc transreg** is finding the combination of two
quadratic functions that meet at **x** = 10. The dataset **aquad**
has several added transformed variables: **x_2** is the square of **x** and **x_3**
is the square of (**x**-10) for **x** > 10 and 0 for **x** < 10. We
can look at a few records in our output dataset to see this.

proc print data = aquad (obs = 5); where x > 9.8; run;Obs _TYPE_ _NAME_ Y TY Intercept X_1 X_2 99 SCORE ROW99 -5.85961 -5.85961 1 9.9000 98.010 100 SCORE ROW100 -5.28805 -5.28805 1 10.0000 100.000 101 SCORE ROW101 0.62507 0.62507 1 10.1000 102.010 102 SCORE ROW102 1.32494 1.32494 1 10.2000 104.040 103 SCORE ROW103 2.09263 2.09263 1 10.3000 106.090 Obs X_3 TIntercept TX_1 TX_2 TX_3 X 99 0.000000 1 9.9000 98.010 0.000000 9.9000 100 0.000000 1 10.0000 100.000 0.000000 10.0000 101 0.010000 1 10.1000 102.010 0.010000 10.1000 102 0.040000 1 10.2000 104.040 0.040000 10.2000 103 0.089999 1 10.3000 106.090 0.089999 10.3000

Again, the last row of the generated dataset contains the coefficients from the
model. In this row, the value for **
_TYPE_**is "M COEFFI". We can
print this line to see the coefficients.

proc print data = aquad; where _TYPE_ eq "M COEFFI"; run;T I I n n t t _ _ e e T N r r Y A c c T T T O P M e X X X e X X X b E E T p _ _ _ p _ _ _ s _ _ Y Y t 1 2 3 t 1 2 3 X 201 M COEFFI TY . . . . . . 23.5360 -5.39639 0.36708 -0.38836 .

We will see the same estimates if we
use **proc reg**:

proc reg data = aquad; model y = x_1 x_2 x_3; output out = aquad_reg predicted = pred; run;Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 7851.67021 2617.22340 56.27 <.0001 Error 196 9116.24661 46.51146 Corrected Total 199 16968 Root MSE 6.81993 R-Square 0.4627 Dependent Mean 12.04335 Adj R-Sq 0.4545 Coeff Var 56.62817 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 23.53600 1.83580 12.82 <.0001 X_1 X 1 1 -5.39639 0.64211 -8.40 <.0001 X_2 X 2 1 0.36708 0.04645 7.90 <.0001 X_3 X 3 1 -0.38836 0.08628 -4.50 <.0001

We can also see that the R-square value in the above regression is the same that appeared in the
**proc transreg** output. We can plot the predicted values
and see that they consist of two curves that meet at** x** = 10.

legend label=none value=('y' 'predicted y') position=(bottom left inside) mode=share down = 2; proc gplot data = aquad_reg; plot (y pred)*x / overlay legend = legend; run; quit;

We have assumed up to this point that there is a known value at which the relationship between y and x changes. If you know that such a point exists, but do not know its location, see SAS FAQ: How can I find where to split a piecewise regression?.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.