<- read_csv("./data/nba_all_seasons.csv", na = c("Undrafted")) nba
7 Manipulating Data
Once you have data loaded, it’s time to begin what is variously called “wrangling”, “munging”, cleaning, or “pre-processing”. These various terms refer to the activities that are almost always necessary between the point at which you acquire your “raw” data and the point at which you are ready to begin your statistical analyses. For example, you might wish to add together several different variables to get a new variable that is the sum. Or you might need to normalize a variable’s measured values so that it has a mean of zero and standard deviation of zero.
7.1 What is dplyr
In the tidyverse, the package responsible for such activities is called dplyr
. This package contains a set of functions, each of which manipulates data in a particular way. Each of these functions is very useful on its own. However, the real power of dplyr comes from the fact that you can repeatedly apply dplyr functions, each operating on the result from the previous. These “chains” of functions allow you to compose complex data manipulation operations that ultimately transform your data into whatever you need. For this reason, dplyr is sometimes said to instantiate a “grammar” of data manipulation, a grammar which is made up of several “verbs” (functions). These function include:
filter()
: select a subset of rows from a data framearrange()
: sort a data frame’s rowsmutate()
create new columns/variables based on existing columns/variablessummarize()
: aggregate one or more columns/variables with a summary statistic (e.g., the mean)group_by()
: assign rows to groups, such that each group shares some values in common
Because dplyr is part of the tidyverse, these all work similarly. Specifically, each dplyr function:
- Takes a dataframe as its first argument
- Takes additional arguments that often indicate which columns are to be operated on
- Returns a modified dataframe
Let’s see some of these characteristics in action. To do so, we’ll first load the data we downloaded back in Section 4.2. Run the following:
Note that we have utilized the na
argument here. This tells read_csv
that we would like any values found in the file that match "Undrafted"
to be treated as “missing”. R uses NA
to represent missing values and so any "Undrafted"
values will be converted in to NA
.
7.2 filter
The filter()
function allows you to specify criteria about the values of a variable in your data set and then filters out only the rows that match that criteria.
The team_abbreviation
for the New York Knicks is "NYK"
. Run the following:
filter(nba, team_abbreviation == "NYK")
# A tibble: 410 × 22
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 81 Herb Willi… NYK 39 211. 118. Ohio S…
2 146 Allan Hous… NYK 26 198. 90.7 Tennes…
3 195 Buck Willi… NYK 37 203. 102. Maryla…
4 204 Charles Oa… NYK 33 206. 111. Virgin…
5 209 Chris Chil… NYK 29 190. 88.5 Boise …
6 212 Chris Jent NYK 27 201. 99.8 Ohio S…
7 218 Charlie Wa… NYK 26 188. 86.2 Florid…
8 230 Scott Broo… NYK 31 180. 74.8 Califo…
9 293 Walter McC… NYK 23 208. 104. Kentuc…
10 347 Larry John… NYK 28 201. 119. Nevada…
# ℹ 400 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>
If you prefer a more thorough inspection, you can look at the results in RStudio’s spreadsheet viewer to ensure that only players on the Knicks heading to Portland are chosen:
View(filter(nba, team_abbreviation == "NYK"))
In either case, we are asking for a test of equality, keeping any rows where team_abbreviation=="NYK"
is true and removing (filtering) any rows where team_abbreviation=="NYK"
is false. To do so, we use the double equal sign ==
(not the single equal sign =
). Trying filter(nba, team_abbreviation = "NYK")
will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign ==
a few times before you get the hang of it.
The equality operator is not the only operator available to us. Others include:
>
corresponds to “greater than”<
corresponds to “less than”>=
corresponds to “greater than or equal to”<=
corresponds to “less than or equal to”!=
corresponds to “not equal to.” The!
is used in many programming languages to indicate “not.”
Furthermore, you can combine multiple criteria using operators that make comparisons:
|
corresponds to “or”&
corresponds to “and”
Let’s look at an example that uses the !
“not” operator to pick rows that don’t match a criteria. As mentioned earlier, the !
can be read as “not.” Here we are filtering rows corresponding to players thare are not from the United States:
filter(nba, country != "USA")
Let’s combine two different requirements using the |
:
filter(nba, country == "Jamaica" | college == "Michigan")
This will select players that are either from Jamica or went to Michigan (or both). We can also request the inverse of this.
filter(nba,!(country == "Jamaica" | college == "Michigan"))
This will select players that are not from Jamaica and those players that did not go to Michigan. Note the use of parentheses around `(country=="Jamaica" | college=="Michigan")
. If we used the !
, but did not use the parentheses, we would only applying the “not” to the first test (country==“Jamaica”), not to the combination of (country=="Jamaica" | college=="Michigan")
. You can try it and compare the results:
filter(nba,!country == "Jamaica" | college == "Michigan")
This request, in contrast to the one above, is for players that are not from Jamaica or went to Michigan (or both). So be very careful about the order of operations and use parentheses liberally. It helps to minimize errors and makes your code more explicit and therefore more readable.
Let’s see a slightly more complicated request that combines several different operators:
filter(nba,
== "NYK" &
team_abbreviation != "USA" | college == "Michigan") & age >= 25) (country
Here we have filtered the data to retain all rows corresponding to players from the NY Knicks, who are age 25 or older, and either went to Michigan or are not from the United States. You can this this yourself and verify that the output matches these requirements (remember that you can use View()
if you wish to inspect the entire result).
Let’s request players that went to either Michigan, Duke, or Georgetown.
filter(nba,
== "Michigan" | college == "Duke" | college == "Georgetown") college
This works, but as we progressively include more collegs, this will get unwieldy to write. A slightly shorter approach uses the %in%
operator along with the c()
function. Recall from Subsection Section 2.5 that the c()
function “combines” or “concatenates” values into a single vector of values.
filter(nba,
%in% c("Michigan", "Duke", "Georgetown", "UCLA", "Kentucky")) college
What this code is doing is filtering our for all flights where college
is in the vector of colleges c("Michigan", "Duke", "Georgetown", "UCLA", "Kentucky")
. This approach produces results that are similar to a sequence of |
operators, but takes much less energy to write (and read). The %in%
operator is useful for looking for matches commonly in one vector/variable compared to another.
As a final note, we recommend that filter()
should often be among the first tidyverse “verbs” you consider applying to your data. This cleans your dataset to only those rows you care about, or to put it another way, it narrows down the scope of your data to just the observations you care about.
7.3 pipe
Before we go any further, let’s first introduce a nifty tool that gets loaded with the dplyr
package: the pipe operator %>%
. The pipe operator allows us to combine multiple tidyverse operations into a single sequential chain of actions.
Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x
using hypothetical functions f()
, g()
, and h()
:
- Take
x
then - Use
x
as an input to a functionf()
then - Use the output of
f(x)
as an input to a functiong()
then - Use the output of
g(f(x))
as an input to a functionh()
One way to achieve this sequence of operations is by using nesting parentheses as follows:
h(g(f(x)))
This code isn’t so hard to read since we are applying only three functions: f()
, then g()
, then h()
and each of the functions has a short name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively more difficult to read as the number of functions applied in your sequence increases and the arguments in each function grow more numerous. This is where the pipe operator %>%
comes in handy. The pipe takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read %>%
as “then” or “and then”. For example, you can obtain the same output as the hypothetical sequence of functions as follows:
%>%
x f() %>%
g() %>%
h()
You would read this sequence as:
- Take
x
then - Use this output as the input to the next function
f()
then - Use this output as the input to the next function
g()
then - Use this output as the input to the next function
h()
Though both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical x
, f()
, g()
, and h()
? Throughout this chapter on data wrangling:
- The starting value,
x
, will be a data frame. For example, thenba
data frame we have been exploring so far. - The sequence of functions, here
f()
,g()
, andh()
, will mostly be a sequence of any number of the data wrangling verb-named functions we listed above. For example, thefilter(college == "Michigan")
function and argument specified we previewed earlier. - The result will be the transformed/modified data frame that you want.
So instead of
filter(nba, team_abbreviation == "NYK")
we can instead write:
%>%
nba filter(team_abbreviation == "NYK")
The benefits of this may not be immediately obvious. But the ability to form a chain of data wrangling operations by combining tidyverse functions (verbs) into a single sequence will be utilized extensively and is made possible by the pipe operator %>%
.
7.4 summarize
The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (i.e., the “average”), the median, and the mode. Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation.
Let’s calculate two summary statistics of the draft_round
variable in the nba
data frame: the mean and standard deviation. To compute these summary statistics, we need the mean()
and sd()
summary functions in R. Summary functions in R take in many values and return a single value. More precisely, we’ll use the mean()
and sd()
summary functions within the summarize()
function from the dplyr
package. The summarize()
function takes in a data frame and returns a data frame with only one row corresponding to the value of the summary statistic(s).
We’ll save the results in a new data frame called summary_round
that will have two columns/variables: the mean
and the std_dev
:
<- nba %>%
summary_round summarize(mean = mean(draft_round),
std_dev = sd(draft_round))
summary_round
# A tibble: 1 × 2
mean std_dev
<dbl> <dbl>
1 NA NA
Why are the values returned NA
? NA
is how R encodes missing values where NA
indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA
is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it. Perhaps the data was not collected at all because it was too difficult to do so. Perhaps there was an erroneous value that someone entered that has been corrected to read as missing. You’ll often encounter issues with missing values when working with real data.
Going back to our summary_round
output, by default any time you try to calculate a summary statistic of a variable that has one or more NA
missing values in R, NA
is returned. If what you wish to calculate is the summary of all the valid, non-missing values, you can set the na.rm
argument to TRUE
, where rm
is short for “remove”. The code that follows computes the mean and standard deviation of all non-missing values of age
:
<- nba %>%
summary_round summarize(
mean = mean(draft_round, na.rm = TRUE),
std_dev = sd(draft_round, na.rm = TRUE)
) summary_round
# A tibble: 1 × 2
mean std_dev
<dbl> <dbl>
1 1.30 0.507
Notice how the na.rm = TRUE
are used as arguments to the mean()
and sd()
summary functions individually, and not to the summarize()
function.
However, one needs to be cautious whenever ignoring missing values as we’ve just done. We will consider the possible ramifications of blindly sweeping rows with missing values “under the rug”. This is in fact why the na.rm
argument to any summary statistic function in R is set to FALSE
by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.
What are other summary functions we can use inside the summarize()
verb to compute summary statistics? You can use any function in R that takes many values and returns just one. Here are just a few:
mean()
: the averagesd()
: the standard deviation, which is a measure of spreadmin()
andmax()
: the minimum and maximum values, respectivelyIQR()
: interquartile rangesum()
: the total amount when adding multiple numbersn()
: a count of the number of rows in each group. This particular summary function will make more sense whengroup_by()
is covered below.
Imagine a public health researcher is studying the effect of smoking on lung cancer for a large number of patients who are observed every five years. She notices that a large number of patients have missing data points, particularly in later observations. The researcher takes the approach outlined above, choosing to ignore these patients in her analysis. How might this be misleading?
Let’s see just a couple more examples of summaries. Here we ask for the number of unique college that appear in the data set:
%>%
nba summarise(n_colleges = n_distinct(college))
# A tibble: 1 × 1
n_colleges
<int>
1 345
Here we combine a filter operation with a summarize operation to calculate the average number of points scored by players in the 2000-2001 NBA season:
%>%
nba filter(season == "2000-01") %>%
summarize(total_points = mean(pts))
# A tibble: 1 × 1
total_points
<dbl>
1 7.81
7.5 select
Sometimes, as in the nba
data frame, you have many, many columns. Often many more than you need or want. The select
function (verb) allows you to select columns from a data frame using a flexible, but concise mini-language. Let’s see a couple of simple examples. Let’s first select the pts
column so that we can isolate it and inspect it without having to deal with the (many) other columns:
%>%
nba select(pts)
# A tibble: 12,305 × 1
pts
<dbl>
1 5.7
2 2.3
3 0.8
4 3.7
5 2.4
6 8.2
7 17.2
8 14.9
9 5.7
10 6.9
# ℹ 12,295 more rows
What if we wish to inspect small number of columns?
%>%
nba select(player_name, team_abbreviation, pts, ast)
# A tibble: 12,305 × 4
player_name team_abbreviation pts ast
<chr> <chr> <dbl> <dbl>
1 Dennis Rodman CHI 5.7 3.1
2 Dwayne Schintzius LAC 2.3 0.3
3 Earl Cureton TOR 0.8 0.4
4 Ed O'Bannon DAL 3.7 0.6
5 Ed Pinckney MIA 2.4 0.2
6 Eddie Johnson HOU 8.2 1
7 Eddie Jones LAL 17.2 3.4
8 Elden Campbell LAL 14.9 1.6
9 Eldridge Recasner ATL 5.7 1.3
10 Elliot Perry MIL 6.9 3
# ℹ 12,295 more rows
We can also use select to remove individual columns:
%>%
nba select(-player_name)
# A tibble: 12,305 × 21
...1 team_abbreviation age player_height player_weight college country
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 0 CHI 36 198. 99.8 Southeaste… USA
2 1 LAC 28 216. 118. Florida USA
3 2 TOR 39 206. 95.3 Detroit Me… USA
4 3 DAL 24 203. 101. UCLA USA
5 4 MIA 34 206. 109. Villanova USA
6 5 HOU 38 201. 97.5 Illinois USA
7 6 LAL 25 198. 86.2 Temple USA
8 7 LAL 28 213. 113. Clemson USA
9 8 ATL 29 193. 86.2 Washington USA
10 9 MIL 28 183. 72.6 Memphis USA
# ℹ 12,295 more rows
# ℹ 14 more variables: draft_year <dbl>, draft_round <dbl>, draft_number <dbl>,
# gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>, net_rating <dbl>,
# oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>, ts_pct <dbl>, ast_pct <dbl>,
# season <chr>
But select is far more powerful than these simple examples suggest. Let’s again select several columns, but group the column names using the c()
function we have seen before:
%>%
nba select(c(player_name, team_abbreviation, pts, ast))
# A tibble: 12,305 × 4
player_name team_abbreviation pts ast
<chr> <chr> <dbl> <dbl>
1 Dennis Rodman CHI 5.7 3.1
2 Dwayne Schintzius LAC 2.3 0.3
3 Earl Cureton TOR 0.8 0.4
4 Ed O'Bannon DAL 3.7 0.6
5 Ed Pinckney MIA 2.4 0.2
6 Eddie Johnson HOU 8.2 1
7 Eddie Jones LAL 17.2 3.4
8 Elden Campbell LAL 14.9 1.6
9 Eldridge Recasner ATL 5.7 1.3
10 Elliot Perry MIL 6.9 3
# ℹ 12,295 more rows
Now, we can select all the remaining columns, simply by inverting the selection criteria:
%>%
nba select(!c(player_name, team_abbreviation, pts, ast))
# A tibble: 12,305 × 18
...1 age player_height player_weight college country draft_year
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 0 36 198. 99.8 Southeastern Okla… USA 1986
2 1 28 216. 118. Florida USA 1990
3 2 39 206. 95.3 Detroit Mercy USA 1979
4 3 24 203. 101. UCLA USA 1995
5 4 34 206. 109. Villanova USA 1985
6 5 38 201. 97.5 Illinois USA 1981
7 6 25 198. 86.2 Temple USA 1994
8 7 28 213. 113. Clemson USA 1990
9 8 29 193. 86.2 Washington USA 1992
10 9 28 183. 72.6 Memphis USA 1991
# ℹ 12,295 more rows
# ℹ 11 more variables: draft_round <dbl>, draft_number <dbl>, gp <dbl>,
# reb <dbl>, net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>
Or we can select columns based on pieces of the column names, rather than the entire name. Some of the columns in our nba
data represent raw numbers (e.g., ast
) and some represent percentages (e.g., ast_pct
). Let’s select any non-percentage columns:
%>%
nba select(!ends_with("_pct"))
# A tibble: 12,305 × 17
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rod… CHI 36 198. 99.8 Southe…
2 1 Dwayne Sch… LAC 28 216. 118. Florida
3 2 Earl Curet… TOR 39 206. 95.3 Detroi…
4 3 Ed O'Bannon DAL 24 203. 101. UCLA
5 4 Ed Pinckney MIA 34 206. 109. Villan…
6 5 Eddie John… HOU 38 201. 97.5 Illino…
7 6 Eddie Jones LAL 25 198. 86.2 Temple
8 7 Elden Camp… LAL 28 213. 113. Clemson
9 8 Eldridge R… ATL 29 193. 86.2 Washin…
10 9 Elliot Per… MIL 28 183. 72.6 Memphis
# ℹ 12,295 more rows
# ℹ 10 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, season <chr>
Some dplyr functions allow users to select columns based on the order in which they are stored in the data frame. You can, for example, the first column in the nba
data set is player_name
. The fifth column in the nba
data set is player_weight
. Calling select(nba, player_name:player_weight)
will select the first 5 columns of the data frame. I strongly discourage you from relying on this sort of functionality. It is extremely easy to reorder the columns and there is no loss of information when you do so. The columns are named, so reordering means that you can always find the information in that column again, regardless of the column order. For these reasons, someone (e.g., whoever is collecting your data) may arbitrarily reorder columns and think that doing so is harmless. If you avoid relying on the column order, it will be harmless!
7.6 group_by
Say instead of a single mean number of points for the entire NBA, you would like a mean number of points separately for each college players attended. In other words, we would like to compute the mean number of points split by college. We can do this by “grouping” the pts
measurements by the values of another variable, in this case by the values of the variable college
. Run the following code:
%>%
nba group_by(college) %>%
summarize(mean = mean(pts),
std_dev = sd(pts))
# A tibble: 345 × 3
college mean std_dev
<chr> <dbl> <dbl>
1 "" 9.9 3.79
2 "Alabama" 9.21 5.83
3 "Alabama A&M" 2.2 NA
4 "Alabama Huntsville" 2.05 0.778
5 "Alabama-Birmingham" 2.4 1.42
6 "Albany State (GA)" 0.45 0.212
7 "American International" 8.34 2.71
8 "American University" 6 8.49
9 "Arizona" 8.93 5.83
10 "Arizona St." 0 NA
# ℹ 335 more rows
This code is similar to the previous code that created summary_round
, but with an extra group_by(college)
added before the summarize()
. Grouping the nba
dataset by college
and then applying the summarize()
functions yields a data frame that displays the mean and standard deviation of pts
split by the different colleges that appear in the data set.
It is important to note that the group_by()
function doesn’t change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize()
function that the data frame changes.
For example, when we run this code:
nba
# A tibble: 12,305 × 22
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rod… CHI 36 198. 99.8 Southe…
2 1 Dwayne Sch… LAC 28 216. 118. Florida
3 2 Earl Curet… TOR 39 206. 95.3 Detroi…
4 3 Ed O'Bannon DAL 24 203. 101. UCLA
5 4 Ed Pinckney MIA 34 206. 109. Villan…
6 5 Eddie John… HOU 38 201. 97.5 Illino…
7 6 Eddie Jones LAL 25 198. 86.2 Temple
8 7 Elden Camp… LAL 28 213. 113. Clemson
9 8 Eldridge R… ATL 29 193. 86.2 Washin…
10 9 Elliot Per… MIL 28 183. 72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>
Observe that the first line of the output reads # A tibble: 12305 x 22
. This is an example of meta-data, in this case the number of observations/rows and variables/columns in nba
. The actual data itself are the subsequent table of values. Now let’s pipe the nba
data frame into group_by(college)
:
%>%
nba group_by(college)
# A tibble: 12,305 × 22
# Groups: college [345]
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rod… CHI 36 198. 99.8 Southe…
2 1 Dwayne Sch… LAC 28 216. 118. Florida
3 2 Earl Curet… TOR 39 206. 95.3 Detroi…
4 3 Ed O'Bannon DAL 24 203. 101. UCLA
5 4 Ed Pinckney MIA 34 206. 109. Villan…
6 5 Eddie John… HOU 38 201. 97.5 Illino…
7 6 Eddie Jones LAL 25 198. 86.2 Temple
8 7 Elden Camp… LAL 28 213. 113. Clemson
9 8 Eldridge R… ATL 29 193. 86.2 Washin…
10 9 Elliot Per… MIL 28 183. 72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>
Observe that now there is additional meta-data: # Groups: college [345]
indicating that the grouping structure meta-data has been set based on the unique values of the variable college
. On the other hand, observe that the data has not changed: it is still a table of 12305 \(\times\) 22 values.
Only by combining a group_by()
with another data wrangling operation, in this case summarize()
, will the data actually be transformed.
%>%
nba group_by(college) %>%
summarize(mean = mean(pts))
# A tibble: 345 × 2
college mean
<chr> <dbl>
1 "" 9.9
2 "Alabama" 9.21
3 "Alabama A&M" 2.2
4 "Alabama Huntsville" 2.05
5 "Alabama-Birmingham" 2.4
6 "Albany State (GA)" 0.45
7 "American International" 8.34
8 "American University" 6
9 "Arizona" 8.93
10 "Arizona St." 0
# ℹ 335 more rows
If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the ungroup()
function:
%>%
nba group_by(college) %>%
ungroup()
# A tibble: 12,305 × 22
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rod… CHI 36 198. 99.8 Southe…
2 1 Dwayne Sch… LAC 28 216. 118. Florida
3 2 Earl Curet… TOR 39 206. 95.3 Detroi…
4 3 Ed O'Bannon DAL 24 203. 101. UCLA
5 4 Ed Pinckney MIA 34 206. 109. Villan…
6 5 Eddie John… HOU 38 201. 97.5 Illino…
7 6 Eddie Jones LAL 25 198. 86.2 Temple
8 7 Elden Camp… LAL 28 213. 113. Clemson
9 8 Eldridge R… ATL 29 193. 86.2 Washin…
10 9 Elliot Per… MIL 28 183. 72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>
Observe how the # Groups: college [345]
meta-data is no longer present.
Let’s now revisit the n()
counting summary function we briefly introduced previously. Recall that the n()
function counts rows. This is opposed to the sum()
summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many rows in nba
are associated with each of the different colleges, we can run this:
<- nba %>%
by_college group_by(college) %>%
summarize(count = n())
by_college
# A tibble: 345 × 2
college count
<chr> <int>
1 "" 5
2 "Alabama" 119
3 "Alabama A&M" 1
4 "Alabama Huntsville" 2
5 "Alabama-Birmingham" 7
6 "Albany State (GA)" 2
7 "American International" 5
8 "American University" 2
9 "Arizona" 279
10 "Arizona St." 1
# ℹ 335 more rows
We see that there are 119 rows in which college=="Alabama"
and 279 rows in which college=="Arizona"
. Note there is a subtle but important difference between sum()
and n()
; while sum()
returns the sum of a numerical variable (adding), n()
returns a count of the number of rows/observations (counting).
7.6.1 Grouping by more than one variable
You are not limited to grouping by one variable. Say you want to know the number of players coming from each college for each NBA season. We can also group by a second variable season
using group_by(college, season)
:
<- nba %>%
by_college_annually group_by(college, season) %>%
summarize(count = n())
`summarise()` has grouped output by 'college'. You can override using the
`.groups` argument.
by_college_annually
# A tibble: 3,637 × 3
# Groups: college [345]
college season count
<chr> <chr> <int>
1 "" 2017-18 2
2 "" 2018-19 2
3 "" 2020-21 1
4 "Alabama" 1996-97 9
5 "Alabama" 1997-98 10
6 "Alabama" 1998-99 8
7 "Alabama" 1999-00 6
8 "Alabama" 2000-01 7
9 "Alabama" 2001-02 6
10 "Alabama" 2002-03 4
# ℹ 3,627 more rows
Observe that there are now 3637 rows to by_college_annually
because there are 12305 unique colleges.
Why do we group_by(college, season)
and not group_by(college)
and then group_by(season)
? Let’s investigate:
<- nba %>%
by_college_annually_incorrect group_by(college) %>%
group_by(season) %>%
summarize(count = n())
by_college_annually_incorrect
# A tibble: 26 × 2
season count
<chr> <int>
1 1996-97 441
2 1997-98 439
3 1998-99 439
4 1999-00 438
5 2000-01 441
6 2001-02 440
7 2002-03 428
8 2003-04 442
9 2004-05 464
10 2005-06 458
# ℹ 16 more rows
What happened here is that the second group_by(season)
overwrote the grouping structure meta-data of the earlier group_by(college)
, so that in the end we are only grouping by season
. The lesson here is if you want to group_by()
two or more variables, you should include all the variables at the same time in the same group_by()
adding a comma between the variable names.
7.7 mutate
Another common transformation of data is to create/compute new variables based on existing ones. For example, the heights in our nba
data set are measured in units of centimeters. But say you are more comfortable thinking of inches instead of centimeters. The formula to convert centimeters to inches is:
\[ \text{height in inches} = \frac{\text{height in centimeters.}}{2.54} \]
We can apply this formula to the player_height
variable using the mutate()
function from the dplyr
package, which takes existing variables and mutates them to create new ones.
<- nba %>%
nba mutate(player_height_in_inch = player_height / 2.54)
In this code, we mutate()
the nba
data frame by creating a new variable player_height_in_inch = height / 2.54
and then overwrite the original nba
data frame. Why did we overwrite the data frame nba
, instead of assigning the result to a new data frame like nba_new
? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable player_height
, but instead created a new variable called player_height_in_inch
? Because if we did this, we would have erased the original information contained in player_height
of heights in centimeters that may still be valuable to us.
Let’s consider another example. We can imagine placing players on a continuum, with one end representing the “star” players and the other end representing “support” players. We can quantify this dimension by by comparing a player’s average points per game (pts
) to his average assists per game (ast
). Let’s calculate this new variable using the mutate()
function:
<- nba %>%
nba mutate(star_support = pts - ast)
Let’s take a look at only the pts
, ast
, and the resulting star_support
variables for a few rows in our updated nba
data frame.
# A tibble: 5 × 3
pts ast star_support
<dbl> <dbl> <dbl>
1 5.7 3.1 2.6
2 2.3 0.3 2
3 17.2 3.4 13.8
4 1.5 2.9 -1.4
5 3.7 0.6 3.1
The player in the third row scored an average of 17.2 points per game and made an average of 3.4 assists per game, so his star_support
value is \(17.2-3.5=13.8\). On the other hand, the player in the fourth row averaged 1.5 points and 2.9 assists, so its star_support
value is \(1.5 - 2.9 = -1.4\).
Let’s look at some summary statistics of the star_support
variable by considering multiple summary functions at once in the same summarize()
code:
<- nba %>%
star_support_summary summarize(
min = min(star_support, na.rm = TRUE),
q1 = quantile(star_support, 0.25, na.rm = TRUE),
median = quantile(star_support, 0.5, na.rm = TRUE),
q3 = quantile(star_support, 0.75, na.rm = TRUE),
max = max(star_support, na.rm = TRUE),
mean = mean(star_support, na.rm = TRUE),
sd = sd(star_support, na.rm = TRUE),
missing = sum(is.na(star_support))
) star_support_summary
# A tibble: 1 × 8
min q1 median q3 max mean sd missing
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 -2.5 2.6 5.1 9 30.9 6.36 4.97 0
We see for example that the average value is 6.36 points, whereas the largest is 30.9! This summary contains quite a bit of information. However, it is often easier to visualize data to evaluate how values are distributed. We’ll take a look at that in Chapter 8.
To close out our discussion on the mutate()
function to create new variables, note that we can create multiple new variables at once in the same mutate()
code. Furthermore, within the same mutate()
code we can refer to new variables we just created. Consider following:
<- nba %>%
nba mutate(
player_height_in_inch = player_height / 2.54,
player_weight_in_lbs = player_weight / 2.2,
bmi = 703 * (player_weight_in_lbs / (player_height_in_inch ^ 2)),
)
7.8 arrange
One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. The dplyr
package’s arrange()
function allows us to sort/reorder a data frame’s rows according to the values of the specified variable.
Suppose we are interested in determining the average number of points per game scored by NBA players from different colleges:
<- nba %>%
mean_pts group_by(college) %>%
summarize(mean = mean(pts),
std_dev = sd(pts))
mean_pts
# A tibble: 345 × 3
college mean std_dev
<chr> <dbl> <dbl>
1 "" 9.9 3.79
2 "Alabama" 9.21 5.83
3 "Alabama A&M" 2.2 NA
4 "Alabama Huntsville" 2.05 0.778
5 "Alabama-Birmingham" 2.4 1.42
6 "Albany State (GA)" 0.45 0.212
7 "American International" 8.34 2.71
8 "American University" 6 8.49
9 "Arizona" 8.93 5.83
10 "Arizona St." 0 NA
# ℹ 335 more rows
Observe that by default the rows of the resulting mean_pts
data frame are sorted in alphabetical order of college
. Say instead we would like to see the same data, but sorted from the highest to the lowest average points (mean
) instead:
%>%
mean_pts arrange(mean)
# A tibble: 345 × 3
college mean std_dev
<chr> <dbl> <dbl>
1 Arizona St. 0 NA
2 Chicago St. 0 NA
3 Denver 0 NA
4 Fairfield 0 NA
5 George Mason 0 NA
6 Lebanon Valley 0 NA
7 Lincoln Memorial 0 NA
8 University of Colorado Boulder 0 NA
9 Toledo 0.2 NA
10 Albany State (GA) 0.45 0.212
# ℹ 335 more rows
This is, however, the opposite of what we want. The rows are sorted with the players scoring the least points displayed first. This is because arrange()
always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the desc()
function as so:
%>%
mean_pts arrange(desc(mean))
# A tibble: 345 × 3
college mean std_dev
<chr> <dbl> <dbl>
1 Davidson 19.6 9.95
2 Lehigh 18.4 7.06
3 Western Carolina 16.1 7.50
4 Navy 15.4 4.25
5 Marist 15.4 1.92
6 Butler Community College 13.6 6.57
7 Louisiana Tech 13.5 6.78
8 Trinity Valley Community College 13.5 6.63
9 Weber State 13.3 11.6
10 Central Arkansas 13.1 4.73
# ℹ 335 more rows
7.9 slice
Now that you can put the rows of a data frame into some sensible order, we will learn how to select rows based on their position within the data frame (contrast this with the earlier advice about ignoring the position/order of columns within the data frame). Within dplyr, there is a very useful function called slice()
. In some ways, this function is similar to filter()
. However, filter()
produces subsets rows based on TRUE or FALSE values (e.g., whether values are greater than zero). Instead, slice()
takes rows numbers (i.e., integers) and returns the rows in the corresponding positions. Let’s look at a few examples.
%>%
nba slice(1, 2, 1048)
# A tibble: 3 × 24
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rodm… CHI 36 198. 99.8 Southe…
2 1 Dwayne Schi… LAC 28 216. 118. Florida
3 1047 Stephon Mar… NJN 22 188. 81.6 Georgi…
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
# star_support <dbl>
This code grabs rows 1, 2, and 1048 and returns a data frame including these three rows. As usual, you can instead use ranges instead of writing things like slice(1, 2, 3, 4, 5, 6, ...)
:
# grab the first 10 rows
%>%
nba slice(1:10)
# A tibble: 10 × 24
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 0 Dennis Rod… CHI 36 198. 99.8 Southe…
2 1 Dwayne Sch… LAC 28 216. 118. Florida
3 2 Earl Curet… TOR 39 206. 95.3 Detroi…
4 3 Ed O'Bannon DAL 24 203. 101. UCLA
5 4 Ed Pinckney MIA 34 206. 109. Villan…
6 5 Eddie John… HOU 38 201. 97.5 Illino…
7 6 Eddie Jones LAL 25 198. 86.2 Temple
8 7 Elden Camp… LAL 28 213. 113. Clemson
9 8 Eldridge R… ATL 29 193. 86.2 Washin…
10 9 Elliot Per… MIL 28 183. 72.6 Memphis
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
# star_support <dbl>
We can also grab the last several rows:
# probably don't do this
%>%
nba slice(12000:nrow(nba))
# A tibble: 306 × 24
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 11999 D.J. Wilson TOR 26 208. 105. Michig…
2 12000 Jonathan K… GSW 19 201. 102. None
3 12001 Klay Thomp… GSW 32 198. 99.8 Washin…
4 12002 Jordan Cla… UTA 30 193. 88.0 Missou…
5 12003 Santi Alda… MEM 21 211. 97.5 Loyola…
6 12004 Sandro Mam… MIL 23 206. 109. Seton …
7 12005 Sam Merrill MEM 26 193. 93.0 Utah S…
8 12006 Sam Hauser BOS 24 201. 98.4 Virgin…
9 12007 Sam Dekker TOR 28 203. 99.3 Wiscon…
10 12008 Saddiq Bey DET 23 201. 97.5 Villan…
# ℹ 296 more rows
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
# star_support <dbl>
Note, however, that this method uses the number of rows (nrow()
) in the full data frame. As a result, this will probably only work if we are applying slice()
directly to our nba
data frame and not do not have any additional verbs in our chain before our call to slice()
. How can we grab the last N
rows if we don’t (necessarily) know how many rows are in the data frame we are applying slice()
to? A more robust approach would be to use the closely related slice_tail()
function. This function is specifically designed to produce the last rows of a data frame:
%>%
nba slice_tail(n = 10)
# A tibble: 10 × 24
...1 player_name team_abbreviation age player_height player_weight college
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 12295 Mason Jones LAL 23 193. 90.7 Arkans…
2 12296 Marvin Bag… DET 23 211. 107. Duke
3 12297 Marquese C… DAL 24 206. 109. Washin…
4 12298 Markus How… DEN 23 178. 79.4 Marque…
5 12299 Marko Simo… CHI 22 211. 99.8 None
6 12300 Markieff M… MIA 32 206. 111. Kansas
7 12301 Markelle F… ORL 24 193. 94.8 Washin…
8 12302 Marcus Sma… BOS 28 193. 99.8 Oklaho…
9 12303 Marcus Gar… MIA 23 196. 93.0 Kansas
10 12304 Micah Pott… DET 24 208. 112. Wiscon…
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
# draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
# net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
# ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
# star_support <dbl>
Ok. This all seems fine. But unless you know ahead of time what data might live in particular rows, this may not seem all that useful. The real power comes when you combine slice()
with something like arrange()
. Let’s see how these verbs work together to produce something a bit more useful.
%>%
nba arrange(desc(pts)) %>%
slice(1:10) %>%
select("player_name", "season", "pts")
# A tibble: 10 × 3
player_name season pts
<chr> <chr> <dbl>
1 James Harden 2018-19 36.1
2 Kobe Bryant 2005-06 35.4
3 James Harden 2019-20 34.3
4 Allen Iverson 2005-06 33
5 Tracy McGrady 2002-03 32.1
6 Kevin Durant 2013-14 32
7 Stephen Curry 2020-21 32
8 Kobe Bryant 2006-07 31.6
9 Russell Westbrook 2016-17 31.6
10 Allen Iverson 2001-02 31.4
This code first sorts the data frame by the pts
column (highest to lowest because of our application of desc()
) and the grabs the first 10 rows. Because we first use arrange()
, we know that these 10 rows are associated with the 10 highest season scoring averages in our data set. We then select()
a relatively small number of columns so we can view it easily.
We have seen slice()
and slice_tail()
in action. But there are more functions in the family. Like slice_tail()
, there is also slice_head()
, which selected the first rows of a data frame. There are also the functions slice_min()
and slice_max()
, which combine sorting and slicing operations. So we can redo our code above to find the 10 highest season scoring averages in our data set using slice_max()
:
%>%
nba slice_max(pts, n = 10) %>%
select("player_name", "season", "pts")
# A tibble: 11 × 3
player_name season pts
<chr> <chr> <dbl>
1 James Harden 2018-19 36.1
2 Kobe Bryant 2005-06 35.4
3 James Harden 2019-20 34.3
4 Allen Iverson 2005-06 33
5 Tracy McGrady 2002-03 32.1
6 Kevin Durant 2013-14 32
7 Stephen Curry 2020-21 32
8 Kobe Bryant 2006-07 31.6
9 Russell Westbrook 2016-17 31.6
10 Allen Iverson 2001-02 31.4
11 LeBron James 2005-06 31.4
All of these functions (except slice()
itself) accept either an n
argument (the number of rows to grab) or a prop
argument (the proportion of all rows to grab). As with all functions in dplyr and the tidyverse, the real power is ultimately realized by including these function into extended chains of operations. You can group_by
and then slice()
, you can mutate()
and then arrange()
and then slice_tail()
, etc. etc. There are an infinite possibilities.
7.10 Activities
If you haven’t already, load the nba
data set we downloaded back in Section 4.2. And let’s pass c("Undrafted")
as the na
argument:
<- read_csv("./data/nba_all_seasons.csv", na = c("Undrafted")) nba
- Produce a data frame consisting exclusively of players under the age of 25.
- Produce a data frame consisting exclusively of players who attended “Marquette”.
- Produce a data frame consisting exclusively of players who are either under the age of 25 or attended “Marquette”.
- Produce a data frame consisting exclusively of players under the age of 25 and attended “Marquette”.
- Determine the average height of all players.
- Determine the minimum weight in the
nba
data set. - Determine the range of ages seen in the
nba
data set. - Determine the interquartile range of all
pts
seen in thenba
data set. - Determine the median height of all players who attended “Connecticut”.
- Use
select
to produce a data frame that only includes columns containing player names and team abbreviations. - Use
select
to produce a data frame that includes all columns except those containing player names and team abbreviations. - Use
select
to produce a data frame that includes all columns whose column names include the string “ea” (hint: useconstains()
). - Use
select
to produce a data frame that includes all columns whose column names does not include an underscore (i.e., “_“). - Bonus: Use
select
to produce a data frame that includes all columns whose column names contain more than 11 characters (hint: you may need to use some combination ofnames()
andnchar()
). - For each college appearing in the data set, what is the average (mean) of
pts
andast
among players who attended that college? - Calculate the mean of
player_weight
and the mean ofplayer_height
for each season in the data set. - Try calculating these per-school averages separately for each season (e.g. means for each combination of
college
andseason
). - Create a new column in the data set called
reb_diff
and calculate it as the difference between the offensive rebound percentage (oreb_pct
) and the defensive rebound percentage (dreb_pct
). - Calculate the mean and standard deviation of
reb_diff
. - Find the 10 largest values of
reb_diff
. - Create a new column in the data set called
pts_over_mean
and calculate it as the difference between that row’spts
value and the overall mean ofpts
in the entire data set. - Generate a data frame with 3 columns: the player’s name, season, and
pts_over_mean
so that you can quickly view your new column and who and when it is associated with. - Use
filter()
to find those rows corresponding to players who went to “UCLA”, and then use arrange() to order the data bypts
. - For each college appearing in the data set, what is the average
pts
of players who attended that college. Which college has produced the highest average? Which college has produced the lowest average? - Use
filter()
to select those data associated with the New York Knicks (abbreviatedNYK
), and then useselect()
to remove the column containing the team abbreviation column. Next, remove all undrafted players (e.g., those who do not have a validdraft_year
). Of these players, find the 6 players with the highestpts
. If there are duplicates (i.e., any player appearing multiple times in the top 6), find the 6 unique players with the highestpts
(so that there are 6 players but no duplicates). - Group the data by
college
andteam_abbreviation
. Now calculate the sum ofpts
for each of these combinations. Now add up all these means. You should have a single number (the sum of some large number of means). Does the chain of operations you initially try work as expected? Does the documentation on dplyr’sungroup()
function help?