### Frequently Asked Questions about R How can I subset a data set?

The R program (as a text file) for all the code on this page.

Subsetting is a very important component of data management and there are several ways that one can subset data in R. This page aims to give a fairly exhaustive list of the ways in which it is possible to subset a data set in R.

First we will create the data frame that will be used in all the examples. We will call this data frame x.df and it will be composed of 5 variables (V1 - V5) where the values come from a normal distribution with a mean 0 and standard deviation of 1; as well as, one variable (y) containing integers from 1 to 5.

x <- matrix(rnorm(30, 1), ncol = 5)
y <- c(1, seq(5))

#combining x and y into one matrix
x <- cbind(x, y)

#converting x into a data frame called x.df
x.df <- data.frame(x)
x.df
V1         V2          V3        V4          V5 y
1 -1.6862356  1.3950211  1.35898920 1.8492410  1.75368860 1
2  0.8610318 -0.5698281 -0.01984841 0.3570547 -0.93262483 1
3 -1.3736436  0.1280908  0.17866428 1.6930332  0.42080132 2
4  0.7557265  1.8622043 -0.29684582 1.0555782  0.09372863 3
5  0.6296957  1.7943359  2.16226397 0.1604166  0.37218504 4
6  0.4694073  1.3096533  1.90324318 1.9372227  1.43930020 5

In order to verify which names are used for the variables in the data frame we use the names function.

names(x.df)
[1] "V1" "V2" "V3" "V4" "V5" "y"

Subsetting rows using the subset function
The subset function with a logical statement will let you subset the data frame by observations. In the following example the x.sub data frame contains only the observations for which the values of the variable y is greater than 2.

x.sub <- subset(x.df, y > 2)
x.sub
V1       V2         V3        V4         V5 y
4 0.7557265 1.862204 -0.2968458 1.0555782 0.09372863 3
5 0.6296957 1.794336  2.1622640 0.1604166 0.37218504 4
6 0.4694073 1.309653  1.9032432 1.9372227 1.43930020 5

Subsetting rows using multiple conditional statements
There is no limit to how many logical statements may be combined to achieve the subsetting that is desired. The data frame x.sub1 contains only the observations for which the values of the variable y is greater than 2 and for which the variable V1 is greater than 0.6.

x.sub1 <- subset(x.df, y > 2 & V1 > 0.6)
x.sub1
V1       V2         V3        V4         V5 y
4 0.7557265 1.862204 -0.2968458 1.0555782 0.09372863 3
5 0.6296957 1.794336  2.1622640 0.1604166 0.37218504 4

Subsetting both rows and columns
It is possible to subset both rows and columns using the subset function. The select argument lets you subset variables (columns). The data frame x.sub2 contains only the variables V1 and V4 and then only the observations of these two variables where the values of variable y are greater than 2 and the values of variable V2 are greater than 0.4.

x.sub2 <- subset(x.df, y > 2 & V2 > 0.4, select = c(V1, V4))
x.sub2
V1        V4
4 0.7557265 1.0555782
5 0.6296957 0.1604166
6 0.4694073 1.9372227

In the data frame x.sub3 contains only the observations in variables V2-V5 for which the values in variable y are greater than 3.

x.sub3 <- subset(x.df, y > 3, select = V2:V5)
x.sub3
V2       V3        V4       V5
5 1.794336 2.162264 0.1604166 0.372185
6 1.309653 1.903243 1.9372227 1.439300

Subsetting rows using indices
Another method for subsetting data sets is by using the bracket notation which designates the indices of the data set. The first index is for the rows and the second for the columns. The x.sub4 data frame contains only the observations for which the values of variable y are equal to 1. Note that leaving the index for the columns blank indicates that we want x.sub4 to contain all the variables (columns) of the original data frame.

x.sub4 <- x.df[x.df$y == 1, ] x.sub4 V1 V2 V3 V4 V5 y 1 -1.6862356 1.395021 1.35898920 1.8492410 1.7536886 1 2 0.8610318 -0.569828 -0.01984841 0.3570547 -0.9326248 1 Subsetting rows selecting on more than one value We use the %in% notation when we want to subset on multiple values of y. The x.sub5 data frame contains only the observations for which the values of variable y are equal to either 1 or 4. x.sub5 <- x.df[x.df$y %in% c(1, 4), ]
x.sub5
V1        V2          V3        V4         V5 y
1 -1.6862356  1.395021  1.35898920 1.8492410  1.7536886 1
2  0.8610318 -0.569828 -0.01984841 0.3570547 -0.9326248 1
5  0.6296957  1.794336  2.16226397 0.1604166  0.3721850 4

Subsetting columns using indices
We can also use the indices to subset the variables (columns) of the data set. The x.sub6 data frame contains only the first two variables of the x.df data frame. Note that leaving the index for the rows blank indicates that we want x.sub6 to contain all the rows of the original data frame.

x.sub6 <- x.df[, 1:2]
x.sub6
V1         V2
1 -1.6862356  1.3950211
2  0.8610318 -0.5698281
3 -1.3736436  0.1280908
4  0.7557265  1.8622043
5  0.6296957  1.7943359
6  0.4694073  1.3096533

The x.sub7 data frame contains all the rows but only the 1st, 3rd and 5th variables (columns) of the x.df data set.

x.sub7 <- x.df[, c(1, 3, 5)]
x.sub7
V1          V3          V5
1 -1.6862356  1.35898920  1.75368860
2  0.8610318 -0.01984841 -0.93262483
3 -1.3736436  0.17866428  0.42080132
4  0.7557265 -0.29684582  0.09372863
5  0.6296957  2.16226397  0.37218504
6  0.4694073  1.90324318  1.43930020

Subsetting both rows and columns using indices
The x.sub8 data frame contains the 3rd-6th variables of x.df and only observations number 1 and 3.

x.sub8 <- x.df[c(1, 3), 3:6]
x.sub8
V3       V4        V5 y
1 1.3589892 1.849241 1.7536886 1
3 0.1786643 1.693033 0.4208013 2

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.