7  Manipulating Data

Once you have data loaded, it’s time to begin what is variously called “wrangling”, “munging”, cleaning, or “pre-processing”. These various terms refer to the activities that are almost always necessary between the point at which you acquire your “raw” data and the point at which you are ready to begin your statistical analyses. For example, you might wish to add together several different variables to get a new variable that is the sum. Or you might need to normalize a variable’s measured values so that it has a mean of zero and standard deviation of zero.

7.1 What is dplyr

In the tidyverse, the package responsible for such activities is called dplyr. This package contains a set of functions, each of which manipulates data in a particular way. Each of these functions is very useful on its own. However, the real power of dplyr comes from the fact that you can repeatedly apply dplyr functions, each operating on the result from the previous. These “chains” of functions allow you to compose complex data manipulation operations that ultimately transform your data into whatever you need. For this reason, dplyr is sometimes said to instantiate a “grammar” of data manipulation, a grammar which is made up of several “verbs” (functions). These function include:

  1. filter(): select a subset of rows from a data frame
  2. arrange(): sort a data frame’s rows
  3. mutate() create new columns/variables based on existing columns/variables
  4. summarize(): aggregate one or more columns/variables with a summary statistic (e.g., the mean)
  5. group_by(): assign rows to groups, such that each group shares some values in common

Because dplyr is part of the tidyverse, these all work similarly. Specifically, each dplyr function:

  1. Takes a dataframe as its first argument
  2. Takes additional arguments that often indicate which columns are to be operated on
  3. Returns a modified dataframe

Let’s see some of these characteristics in action. To do so, we’ll first load the data we downloaded back in Section 4.2. Run the following:

nba <- read_csv("./data/nba_all_seasons.csv", na = c("Undrafted"))

Note that we have utilized the na argument here. This tells read_csv that we would like any values found in the file that match "Undrafted" to be treated as “missing”. R uses NA to represent missing values and so any "Undrafted" values will be converted in to NA.

7.2 filter

The filter() function allows you to specify criteria about the values of a variable in your data set and then filters out only the rows that match that criteria.

The team_abbreviation for the New York Knicks is "NYK". Run the following:

filter(nba, team_abbreviation == "NYK")
# A tibble: 410 × 22
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1    81 Herb Willi… NYK                  39          211.         118.  Ohio S…
 2   146 Allan Hous… NYK                  26          198.          90.7 Tennes…
 3   195 Buck Willi… NYK                  37          203.         102.  Maryla…
 4   204 Charles Oa… NYK                  33          206.         111.  Virgin…
 5   209 Chris Chil… NYK                  29          190.          88.5 Boise …
 6   212 Chris Jent  NYK                  27          201.          99.8 Ohio S…
 7   218 Charlie Wa… NYK                  26          188.          86.2 Florid…
 8   230 Scott Broo… NYK                  31          180.          74.8 Califo…
 9   293 Walter McC… NYK                  23          208.         104.  Kentuc…
10   347 Larry John… NYK                  28          201.         119.  Nevada…
# ℹ 400 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>

If you prefer a more thorough inspection, you can look at the results in RStudio’s spreadsheet viewer to ensure that only players on the Knicks heading to Portland are chosen:

View(filter(nba, team_abbreviation == "NYK"))

In either case, we are asking for a test of equality, keeping any rows where team_abbreviation=="NYK" is true and removing (filtering) any rows where team_abbreviation=="NYK" is false. To do so, we use the double equal sign == (not the single equal sign =). Trying filter(nba, team_abbreviation = "NYK") will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign == a few times before you get the hang of it.

The equality operator is not the only operator available to us. Others include:

  • > corresponds to “greater than”
  • < corresponds to “less than”
  • >= corresponds to “greater than or equal to”
  • <= corresponds to “less than or equal to”
  • != corresponds to “not equal to.” The ! is used in many programming languages to indicate “not.”

Furthermore, you can combine multiple criteria using operators that make comparisons:

  • | corresponds to “or”
  • & corresponds to “and”

Let’s look at an example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to players thare are not from the United States:

filter(nba, country != "USA")

Let’s combine two different requirements using the |:

filter(nba, country == "Jamaica" | college == "Michigan")

This will select players that are either from Jamica or went to Michigan (or both). We can also request the inverse of this.

filter(nba,!(country == "Jamaica" | college == "Michigan"))

This will select players that are not from Jamaica and those players that did not go to Michigan. Note the use of parentheses around `(country=="Jamaica" | college=="Michigan"). If we used the !, but did not use the parentheses, we would only applying the “not” to the first test (country==“Jamaica”), not to the combination of (country=="Jamaica" | college=="Michigan"). You can try it and compare the results:

filter(nba,!country == "Jamaica" | college == "Michigan")

This request, in contrast to the one above, is for players that are not from Jamaica or went to Michigan (or both). So be very careful about the order of operations and use parentheses liberally. It helps to minimize errors and makes your code more explicit and therefore more readable.

Let’s see a slightly more complicated request that combines several different operators:

filter(nba,
       team_abbreviation == "NYK" &
           (country != "USA" | college == "Michigan") & age >= 25)

Here we have filtered the data to retain all rows corresponding to players from the NY Knicks, who are age 25 or older, and either went to Michigan or are not from the United States. You can this this yourself and verify that the output matches these requirements (remember that you can use View() if you wish to inspect the entire result).

Let’s request players that went to either Michigan, Duke, or Georgetown.

filter(nba,
       college == "Michigan" | college == "Duke" | college == "Georgetown")

This works, but as we progressively include more collegs, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection Section 2.5 that the c() function “combines” or “concatenates” values into a single vector of values.

filter(nba,
       college %in% c("Michigan", "Duke", "Georgetown", "UCLA", "Kentucky"))

What this code is doing is filtering our for all flights where college is in the vector of colleges c("Michigan", "Duke", "Georgetown", "UCLA", "Kentucky"). This approach produces results that are similar to a sequence of | operators, but takes much less energy to write (and read). The %in% operator is useful for looking for matches commonly in one vector/variable compared to another.

As a final note, we recommend that filter() should often be among the first tidyverse “verbs” you consider applying to your data. This cleans your dataset to only those rows you care about, or to put it another way, it narrows down the scope of your data to just the observations you care about.

7.3 pipe

Before we go any further, let’s first introduce a nifty tool that gets loaded with the dplyr package: the pipe operator %>%. The pipe operator allows us to combine multiple tidyverse operations into a single sequential chain of actions.

Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h():

  1. Take x then
  2. Use x as an input to a function f() then
  3. Use the output of f(x) as an input to a function g() then
  4. Use the output of g(f(x)) as an input to a function h()

One way to achieve this sequence of operations is by using nesting parentheses as follows:

h(g(f(x)))

This code isn’t so hard to read since we are applying only three functions: f(), then g(), then h() and each of the functions has a short name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively more difficult to read as the number of functions applied in your sequence increases and the arguments in each function grow more numerous. This is where the pipe operator %>% comes in handy. The pipe takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read %>% as “then” or “and then”. For example, you can obtain the same output as the hypothetical sequence of functions as follows:

x %>%
    f() %>%
    g() %>%
    h()

You would read this sequence as:

  1. Take x then
  2. Use this output as the input to the next function f() then
  3. Use this output as the input to the next function g() then
  4. Use this output as the input to the next function h()

Though both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical x, f(), g(), and h()? Throughout this chapter on data wrangling:

  1. The starting value, x, will be a data frame. For example, the nba data frame we have been exploring so far.
  2. The sequence of functions, here f(), g(), and h(), will mostly be a sequence of any number of the data wrangling verb-named functions we listed above. For example, the filter(college == "Michigan") function and argument specified we previewed earlier.
  3. The result will be the transformed/modified data frame that you want.

So instead of

filter(nba, team_abbreviation == "NYK")

we can instead write:

nba %>%
    filter(team_abbreviation == "NYK")

The benefits of this may not be immediately obvious. But the ability to form a chain of data wrangling operations by combining tidyverse functions (verbs) into a single sequence will be utilized extensively and is made possible by the pipe operator %>%.

7.4 summarize

The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (i.e., the “average”), the median, and the mode. Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation.

Let’s calculate two summary statistics of the draft_round variable in the nba data frame: the mean and standard deviation. To compute these summary statistics, we need the mean() and sd() summary functions in R. Summary functions in R take in many values and return a single value. More precisely, we’ll use the mean() and sd() summary functions within the summarize() function from the dplyr package. The summarize() function takes in a data frame and returns a data frame with only one row corresponding to the value of the summary statistic(s).

We’ll save the results in a new data frame called summary_round that will have two columns/variables: the mean and the std_dev:

summary_round <- nba %>%
    summarize(mean = mean(draft_round),
              std_dev = sd(draft_round))
summary_round
# A tibble: 1 × 2
   mean std_dev
  <dbl>   <dbl>
1    NA      NA

Why are the values returned NA? NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it. Perhaps the data was not collected at all because it was too difficult to do so. Perhaps there was an erroneous value that someone entered that has been corrected to read as missing. You’ll often encounter issues with missing values when working with real data.

Going back to our summary_round output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. If what you wish to calculate is the summary of all the valid, non-missing values, you can set the na.rm argument to TRUE, where rm is short for “remove”. The code that follows computes the mean and standard deviation of all non-missing values of age:

summary_round <- nba %>%
    summarize(
        mean = mean(draft_round, na.rm = TRUE),
        std_dev = sd(draft_round, na.rm = TRUE)
    )
summary_round
# A tibble: 1 × 2
   mean std_dev
  <dbl>   <dbl>
1  1.30   0.507

Notice how the na.rm = TRUE are used as arguments to the mean() and sd() summary functions individually, and not to the summarize() function.

However, one needs to be cautious whenever ignoring missing values as we’ve just done. We will consider the possible ramifications of blindly sweeping rows with missing values “under the rug”. This is in fact why the na.rm argument to any summary statistic function in R is set to FALSE by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.

What are other summary functions we can use inside the summarize() verb to compute summary statistics? You can use any function in R that takes many values and returns just one. Here are just a few:

  • mean(): the average
  • sd(): the standard deviation, which is a measure of spread
  • min() and max(): the minimum and maximum values, respectively
  • IQR(): interquartile range
  • sum(): the total amount when adding multiple numbers
  • n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by() is covered below.
Sometimes missingness is data

Imagine a public health researcher is studying the effect of smoking on lung cancer for a large number of patients who are observed every five years. She notices that a large number of patients have missing data points, particularly in later observations. The researcher takes the approach outlined above, choosing to ignore these patients in her analysis. How might this be misleading?

Let’s see just a couple more examples of summaries. Here we ask for the number of unique college that appear in the data set:

nba %>%
    summarise(n_colleges = n_distinct(college))
# A tibble: 1 × 1
  n_colleges
       <int>
1        345

Here we combine a filter operation with a summarize operation to calculate the average number of points scored by players in the 2000-2001 NBA season:

nba %>%
    filter(season == "2000-01") %>%
    summarize(total_points = mean(pts))
# A tibble: 1 × 1
  total_points
         <dbl>
1         7.81

7.5 select

Sometimes, as in the nba data frame, you have many, many columns. Often many more than you need or want. The select function (verb) allows you to select columns from a data frame using a flexible, but concise mini-language. Let’s see a couple of simple examples. Let’s first select the pts column so that we can isolate it and inspect it without having to deal with the (many) other columns:

nba %>%
    select(pts)
# A tibble: 12,305 × 1
     pts
   <dbl>
 1   5.7
 2   2.3
 3   0.8
 4   3.7
 5   2.4
 6   8.2
 7  17.2
 8  14.9
 9   5.7
10   6.9
# ℹ 12,295 more rows

What if we wish to inspect small number of columns?

nba %>%
    select(player_name, team_abbreviation, pts, ast)
# A tibble: 12,305 × 4
   player_name       team_abbreviation   pts   ast
   <chr>             <chr>             <dbl> <dbl>
 1 Dennis Rodman     CHI                 5.7   3.1
 2 Dwayne Schintzius LAC                 2.3   0.3
 3 Earl Cureton      TOR                 0.8   0.4
 4 Ed O'Bannon       DAL                 3.7   0.6
 5 Ed Pinckney       MIA                 2.4   0.2
 6 Eddie Johnson     HOU                 8.2   1  
 7 Eddie Jones       LAL                17.2   3.4
 8 Elden Campbell    LAL                14.9   1.6
 9 Eldridge Recasner ATL                 5.7   1.3
10 Elliot Perry      MIL                 6.9   3  
# ℹ 12,295 more rows

We can also use select to remove individual columns:

nba %>%
    select(-player_name)
# A tibble: 12,305 × 21
    ...1 team_abbreviation   age player_height player_weight college     country
   <dbl> <chr>             <dbl>         <dbl>         <dbl> <chr>       <chr>  
 1     0 CHI                  36          198.          99.8 Southeaste… USA    
 2     1 LAC                  28          216.         118.  Florida     USA    
 3     2 TOR                  39          206.          95.3 Detroit Me… USA    
 4     3 DAL                  24          203.         101.  UCLA        USA    
 5     4 MIA                  34          206.         109.  Villanova   USA    
 6     5 HOU                  38          201.          97.5 Illinois    USA    
 7     6 LAL                  25          198.          86.2 Temple      USA    
 8     7 LAL                  28          213.         113.  Clemson     USA    
 9     8 ATL                  29          193.          86.2 Washington  USA    
10     9 MIL                  28          183.          72.6 Memphis     USA    
# ℹ 12,295 more rows
# ℹ 14 more variables: draft_year <dbl>, draft_round <dbl>, draft_number <dbl>,
#   gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>, net_rating <dbl>,
#   oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>, ts_pct <dbl>, ast_pct <dbl>,
#   season <chr>

But select is far more powerful than these simple examples suggest. Let’s again select several columns, but group the column names using the c() function we have seen before:

nba %>%
    select(c(player_name, team_abbreviation, pts, ast))
# A tibble: 12,305 × 4
   player_name       team_abbreviation   pts   ast
   <chr>             <chr>             <dbl> <dbl>
 1 Dennis Rodman     CHI                 5.7   3.1
 2 Dwayne Schintzius LAC                 2.3   0.3
 3 Earl Cureton      TOR                 0.8   0.4
 4 Ed O'Bannon       DAL                 3.7   0.6
 5 Ed Pinckney       MIA                 2.4   0.2
 6 Eddie Johnson     HOU                 8.2   1  
 7 Eddie Jones       LAL                17.2   3.4
 8 Elden Campbell    LAL                14.9   1.6
 9 Eldridge Recasner ATL                 5.7   1.3
10 Elliot Perry      MIL                 6.9   3  
# ℹ 12,295 more rows

Now, we can select all the remaining columns, simply by inverting the selection criteria:

nba %>%
    select(!c(player_name, team_abbreviation, pts, ast))
# A tibble: 12,305 × 18
    ...1   age player_height player_weight college            country draft_year
   <dbl> <dbl>         <dbl>         <dbl> <chr>              <chr>        <dbl>
 1     0    36          198.          99.8 Southeastern Okla… USA           1986
 2     1    28          216.         118.  Florida            USA           1990
 3     2    39          206.          95.3 Detroit Mercy      USA           1979
 4     3    24          203.         101.  UCLA               USA           1995
 5     4    34          206.         109.  Villanova          USA           1985
 6     5    38          201.          97.5 Illinois           USA           1981
 7     6    25          198.          86.2 Temple             USA           1994
 8     7    28          213.         113.  Clemson            USA           1990
 9     8    29          193.          86.2 Washington         USA           1992
10     9    28          183.          72.6 Memphis            USA           1991
# ℹ 12,295 more rows
# ℹ 11 more variables: draft_round <dbl>, draft_number <dbl>, gp <dbl>,
#   reb <dbl>, net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>

Or we can select columns based on pieces of the column names, rather than the entire name. Some of the columns in our nba data represent raw numbers (e.g., ast) and some represent percentages (e.g., ast_pct). Let’s select any non-percentage columns:

nba %>%
    select(!ends_with("_pct"))
# A tibble: 12,305 × 17
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1     0 Dennis Rod… CHI                  36          198.          99.8 Southe…
 2     1 Dwayne Sch… LAC                  28          216.         118.  Florida
 3     2 Earl Curet… TOR                  39          206.          95.3 Detroi…
 4     3 Ed O'Bannon DAL                  24          203.         101.  UCLA   
 5     4 Ed Pinckney MIA                  34          206.         109.  Villan…
 6     5 Eddie John… HOU                  38          201.          97.5 Illino…
 7     6 Eddie Jones LAL                  25          198.          86.2 Temple 
 8     7 Elden Camp… LAL                  28          213.         113.  Clemson
 9     8 Eldridge R… ATL                  29          193.          86.2 Washin…
10     9 Elliot Per… MIL                  28          183.          72.6 Memphis
# ℹ 12,295 more rows
# ℹ 10 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, season <chr>
Do not assume columns are ordered

Some dplyr functions allow users to select columns based on the order in which they are stored in the data frame. You can, for example, the first column in the nba data set is player_name. The fifth column in the nba data set is player_weight. Calling select(nba, player_name:player_weight) will select the first 5 columns of the data frame. I strongly discourage you from relying on this sort of functionality. It is extremely easy to reorder the columns and there is no loss of information when you do so. The columns are named, so reordering means that you can always find the information in that column again, regardless of the column order. For these reasons, someone (e.g., whoever is collecting your data) may arbitrarily reorder columns and think that doing so is harmless. If you avoid relying on the column order, it will be harmless!

7.6 group_by

Say instead of a single mean number of points for the entire NBA, you would like a mean number of points separately for each college players attended. In other words, we would like to compute the mean number of points split by college. We can do this by “grouping” the pts measurements by the values of another variable, in this case by the values of the variable college . Run the following code:

nba %>%
    group_by(college) %>%
    summarize(mean = mean(pts),
              std_dev = sd(pts))
# A tibble: 345 × 3
   college                   mean std_dev
   <chr>                    <dbl>   <dbl>
 1 ""                        9.9    3.79 
 2 "Alabama"                 9.21   5.83 
 3 "Alabama A&M"             2.2   NA    
 4 "Alabama Huntsville"      2.05   0.778
 5 "Alabama-Birmingham"      2.4    1.42 
 6 "Albany State (GA)"       0.45   0.212
 7 "American International"  8.34   2.71 
 8 "American University"     6      8.49 
 9 "Arizona"                 8.93   5.83 
10 "Arizona St."             0     NA    
# ℹ 335 more rows

This code is similar to the previous code that created summary_round, but with an extra group_by(college) added before the summarize(). Grouping the nba dataset by college and then applying the summarize() functions yields a data frame that displays the mean and standard deviation of pts split by the different colleges that appear in the data set.

It is important to note that the group_by() function doesn’t change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize() function that the data frame changes.

For example, when we run this code:

nba
# A tibble: 12,305 × 22
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1     0 Dennis Rod… CHI                  36          198.          99.8 Southe…
 2     1 Dwayne Sch… LAC                  28          216.         118.  Florida
 3     2 Earl Curet… TOR                  39          206.          95.3 Detroi…
 4     3 Ed O'Bannon DAL                  24          203.         101.  UCLA   
 5     4 Ed Pinckney MIA                  34          206.         109.  Villan…
 6     5 Eddie John… HOU                  38          201.          97.5 Illino…
 7     6 Eddie Jones LAL                  25          198.          86.2 Temple 
 8     7 Elden Camp… LAL                  28          213.         113.  Clemson
 9     8 Eldridge R… ATL                  29          193.          86.2 Washin…
10     9 Elliot Per… MIL                  28          183.          72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>

Observe that the first line of the output reads # A tibble: 12305 x 22. This is an example of meta-data, in this case the number of observations/rows and variables/columns in nba. The actual data itself are the subsequent table of values. Now let’s pipe the nba data frame into group_by(college):

nba %>%
    group_by(college)
# A tibble: 12,305 × 22
# Groups:   college [345]
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1     0 Dennis Rod… CHI                  36          198.          99.8 Southe…
 2     1 Dwayne Sch… LAC                  28          216.         118.  Florida
 3     2 Earl Curet… TOR                  39          206.          95.3 Detroi…
 4     3 Ed O'Bannon DAL                  24          203.         101.  UCLA   
 5     4 Ed Pinckney MIA                  34          206.         109.  Villan…
 6     5 Eddie John… HOU                  38          201.          97.5 Illino…
 7     6 Eddie Jones LAL                  25          198.          86.2 Temple 
 8     7 Elden Camp… LAL                  28          213.         113.  Clemson
 9     8 Eldridge R… ATL                  29          193.          86.2 Washin…
10     9 Elliot Per… MIL                  28          183.          72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>

Observe that now there is additional meta-data: # Groups: college [345] indicating that the grouping structure meta-data has been set based on the unique values of the variable college. On the other hand, observe that the data has not changed: it is still a table of 12305 \(\times\) 22 values.

Only by combining a group_by() with another data wrangling operation, in this case summarize(), will the data actually be transformed.

nba %>%
    group_by(college) %>%
    summarize(mean = mean(pts))
# A tibble: 345 × 2
   college                   mean
   <chr>                    <dbl>
 1 ""                        9.9 
 2 "Alabama"                 9.21
 3 "Alabama A&M"             2.2 
 4 "Alabama Huntsville"      2.05
 5 "Alabama-Birmingham"      2.4 
 6 "Albany State (GA)"       0.45
 7 "American International"  8.34
 8 "American University"     6   
 9 "Arizona"                 8.93
10 "Arizona St."             0   
# ℹ 335 more rows

If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the ungroup() function:

nba %>%
    group_by(college) %>%
    ungroup()
# A tibble: 12,305 × 22
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1     0 Dennis Rod… CHI                  36          198.          99.8 Southe…
 2     1 Dwayne Sch… LAC                  28          216.         118.  Florida
 3     2 Earl Curet… TOR                  39          206.          95.3 Detroi…
 4     3 Ed O'Bannon DAL                  24          203.         101.  UCLA   
 5     4 Ed Pinckney MIA                  34          206.         109.  Villan…
 6     5 Eddie John… HOU                  38          201.          97.5 Illino…
 7     6 Eddie Jones LAL                  25          198.          86.2 Temple 
 8     7 Elden Camp… LAL                  28          213.         113.  Clemson
 9     8 Eldridge R… ATL                  29          193.          86.2 Washin…
10     9 Elliot Per… MIL                  28          183.          72.6 Memphis
# ℹ 12,295 more rows
# ℹ 15 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>

Observe how the # Groups: college [345] meta-data is no longer present.

Let’s now revisit the n() counting summary function we briefly introduced previously. Recall that the n() function counts rows. This is opposed to the sum() summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many rows in nba are associated with each of the different colleges, we can run this:

by_college <- nba %>%
    group_by(college) %>%
    summarize(count = n())
by_college
# A tibble: 345 × 2
   college                  count
   <chr>                    <int>
 1 ""                           5
 2 "Alabama"                  119
 3 "Alabama A&M"                1
 4 "Alabama Huntsville"         2
 5 "Alabama-Birmingham"         7
 6 "Albany State (GA)"          2
 7 "American International"     5
 8 "American University"        2
 9 "Arizona"                  279
10 "Arizona St."                1
# ℹ 335 more rows

We see that there are 119 rows in which college=="Alabama" and 279 rows in which college=="Arizona". Note there is a subtle but important difference between sum() and n(); while sum() returns the sum of a numerical variable (adding), n() returns a count of the number of rows/observations (counting).

7.6.1 Grouping by more than one variable

You are not limited to grouping by one variable. Say you want to know the number of players coming from each college for each NBA season. We can also group by a second variable season using group_by(college, season):

by_college_annually <- nba %>%
    group_by(college, season) %>%
    summarize(count = n())
`summarise()` has grouped output by 'college'. You can override using the
`.groups` argument.
by_college_annually
# A tibble: 3,637 × 3
# Groups:   college [345]
   college   season  count
   <chr>     <chr>   <int>
 1 ""        2017-18     2
 2 ""        2018-19     2
 3 ""        2020-21     1
 4 "Alabama" 1996-97     9
 5 "Alabama" 1997-98    10
 6 "Alabama" 1998-99     8
 7 "Alabama" 1999-00     6
 8 "Alabama" 2000-01     7
 9 "Alabama" 2001-02     6
10 "Alabama" 2002-03     4
# ℹ 3,627 more rows

Observe that there are now 3637 rows to by_college_annually because there are 12305 unique colleges.

Why do we group_by(college, season) and not group_by(college) and then group_by(season)? Let’s investigate:

by_college_annually_incorrect <- nba %>%
    group_by(college) %>%
    group_by(season) %>%
    summarize(count = n())
by_college_annually_incorrect
# A tibble: 26 × 2
   season  count
   <chr>   <int>
 1 1996-97   441
 2 1997-98   439
 3 1998-99   439
 4 1999-00   438
 5 2000-01   441
 6 2001-02   440
 7 2002-03   428
 8 2003-04   442
 9 2004-05   464
10 2005-06   458
# ℹ 16 more rows

What happened here is that the second group_by(season) overwrote the grouping structure meta-data of the earlier group_by(college), so that in the end we are only grouping by season. The lesson here is if you want to group_by() two or more variables, you should include all the variables at the same time in the same group_by() adding a comma between the variable names.

7.7 mutate

Another common transformation of data is to create/compute new variables based on existing ones. For example, the heights in our nba data set are measured in units of centimeters. But say you are more comfortable thinking of inches instead of centimeters. The formula to convert centimeters to inches is:

\[ \text{height in inches} = \frac{\text{height in centimeters.}}{2.54} \]

We can apply this formula to the player_height variable using the mutate() function from the dplyr package, which takes existing variables and mutates them to create new ones.

nba <- nba %>%
    mutate(player_height_in_inch = player_height / 2.54)

In this code, we mutate() the nba data frame by creating a new variable player_height_in_inch = height / 2.54 and then overwrite the original nba data frame. Why did we overwrite the data frame nba, instead of assigning the result to a new data frame like nba_new? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable player_height, but instead created a new variable called player_height_in_inch? Because if we did this, we would have erased the original information contained in player_height of heights in centimeters that may still be valuable to us.

Let’s consider another example. We can imagine placing players on a continuum, with one end representing the “star” players and the other end representing “support” players. We can quantify this dimension by by comparing a player’s average points per game (pts) to his average assists per game (ast). Let’s calculate this new variable using the mutate() function:

nba <- nba %>%
    mutate(star_support = pts - ast)

Let’s take a look at only the pts, ast, and the resulting star_support variables for a few rows in our updated nba data frame.

# A tibble: 5 × 3
    pts   ast star_support
  <dbl> <dbl>        <dbl>
1   5.7   3.1          2.6
2   2.3   0.3          2  
3  17.2   3.4         13.8
4   1.5   2.9         -1.4
5   3.7   0.6          3.1

The player in the third row scored an average of 17.2 points per game and made an average of 3.4 assists per game, so his star_support value is \(17.2-3.5=13.8\). On the other hand, the player in the fourth row averaged 1.5 points and 2.9 assists, so its star_support value is \(1.5 - 2.9 = -1.4\).

Let’s look at some summary statistics of the star_support variable by considering multiple summary functions at once in the same summarize() code:

star_support_summary <- nba %>%
    summarize(
        min = min(star_support, na.rm = TRUE),
        q1 = quantile(star_support, 0.25, na.rm = TRUE),
        median = quantile(star_support, 0.5, na.rm = TRUE),
        q3 = quantile(star_support, 0.75, na.rm = TRUE),
        max = max(star_support, na.rm = TRUE),
        mean = mean(star_support, na.rm = TRUE),
        sd = sd(star_support, na.rm = TRUE),
        missing = sum(is.na(star_support))
    )
star_support_summary
# A tibble: 1 × 8
    min    q1 median    q3   max  mean    sd missing
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <int>
1  -2.5   2.6    5.1     9  30.9  6.36  4.97       0

We see for example that the average value is 6.36 points, whereas the largest is 30.9! This summary contains quite a bit of information. However, it is often easier to visualize data to evaluate how values are distributed. We’ll take a look at that in Chapter 8.

To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code. Furthermore, within the same mutate() code we can refer to new variables we just created. Consider following:

nba <- nba %>%
    mutate(
        player_height_in_inch = player_height / 2.54,
        player_weight_in_lbs = player_weight / 2.2,
        bmi = 703 * (player_weight_in_lbs / (player_height_in_inch ^ 2)),
    )

7.8 arrange

One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. The dplyr package’s arrange() function allows us to sort/reorder a data frame’s rows according to the values of the specified variable.

Suppose we are interested in determining the average number of points per game scored by NBA players from different colleges:

mean_pts <- nba %>%
    group_by(college) %>%
    summarize(mean = mean(pts),
              std_dev = sd(pts))
mean_pts
# A tibble: 345 × 3
   college                   mean std_dev
   <chr>                    <dbl>   <dbl>
 1 ""                        9.9    3.79 
 2 "Alabama"                 9.21   5.83 
 3 "Alabama A&M"             2.2   NA    
 4 "Alabama Huntsville"      2.05   0.778
 5 "Alabama-Birmingham"      2.4    1.42 
 6 "Albany State (GA)"       0.45   0.212
 7 "American International"  8.34   2.71 
 8 "American University"     6      8.49 
 9 "Arizona"                 8.93   5.83 
10 "Arizona St."             0     NA    
# ℹ 335 more rows

Observe that by default the rows of the resulting mean_pts data frame are sorted in alphabetical order of college. Say instead we would like to see the same data, but sorted from the highest to the lowest average points (mean) instead:

mean_pts %>%
    arrange(mean)
# A tibble: 345 × 3
   college                         mean std_dev
   <chr>                          <dbl>   <dbl>
 1 Arizona St.                     0     NA    
 2 Chicago St.                     0     NA    
 3 Denver                          0     NA    
 4 Fairfield                       0     NA    
 5 George Mason                    0     NA    
 6 Lebanon Valley                  0     NA    
 7 Lincoln Memorial                0     NA    
 8 University of Colorado Boulder  0     NA    
 9 Toledo                          0.2   NA    
10 Albany State (GA)               0.45   0.212
# ℹ 335 more rows

This is, however, the opposite of what we want. The rows are sorted with the players scoring the least points displayed first. This is because arrange() always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the desc() function as so:

mean_pts %>%
    arrange(desc(mean))
# A tibble: 345 × 3
   college                           mean std_dev
   <chr>                            <dbl>   <dbl>
 1 Davidson                          19.6    9.95
 2 Lehigh                            18.4    7.06
 3 Western Carolina                  16.1    7.50
 4 Navy                              15.4    4.25
 5 Marist                            15.4    1.92
 6 Butler Community College          13.6    6.57
 7 Louisiana Tech                    13.5    6.78
 8 Trinity Valley Community College  13.5    6.63
 9 Weber State                       13.3   11.6 
10 Central Arkansas                  13.1    4.73
# ℹ 335 more rows

7.9 slice

Now that you can put the rows of a data frame into some sensible order, we will learn how to select rows based on their position within the data frame (contrast this with the earlier advice about ignoring the position/order of columns within the data frame). Within dplyr, there is a very useful function called slice(). In some ways, this function is similar to filter(). However, filter() produces subsets rows based on TRUE or FALSE values (e.g., whether values are greater than zero). Instead, slice() takes rows numbers (i.e., integers) and returns the rows in the corresponding positions. Let’s look at a few examples.

nba %>%
    slice(1, 2, 1048)
# A tibble: 3 × 24
   ...1 player_name  team_abbreviation   age player_height player_weight college
  <dbl> <chr>        <chr>             <dbl>         <dbl>         <dbl> <chr>  
1     0 Dennis Rodm… CHI                  36          198.          99.8 Southe…
2     1 Dwayne Schi… LAC                  28          216.         118.  Florida
3  1047 Stephon Mar… NJN                  22          188.          81.6 Georgi…
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
#   star_support <dbl>

This code grabs rows 1, 2, and 1048 and returns a data frame including these three rows. As usual, you can instead use ranges instead of writing things like slice(1, 2, 3, 4, 5, 6, ...):

# grab the first 10 rows
nba %>%
    slice(1:10)
# A tibble: 10 × 24
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1     0 Dennis Rod… CHI                  36          198.          99.8 Southe…
 2     1 Dwayne Sch… LAC                  28          216.         118.  Florida
 3     2 Earl Curet… TOR                  39          206.          95.3 Detroi…
 4     3 Ed O'Bannon DAL                  24          203.         101.  UCLA   
 5     4 Ed Pinckney MIA                  34          206.         109.  Villan…
 6     5 Eddie John… HOU                  38          201.          97.5 Illino…
 7     6 Eddie Jones LAL                  25          198.          86.2 Temple 
 8     7 Elden Camp… LAL                  28          213.         113.  Clemson
 9     8 Eldridge R… ATL                  29          193.          86.2 Washin…
10     9 Elliot Per… MIL                  28          183.          72.6 Memphis
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
#   star_support <dbl>

We can also grab the last several rows:

# probably don't do this
nba %>%
    slice(12000:nrow(nba))
# A tibble: 306 × 24
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1 11999 D.J. Wilson TOR                  26          208.         105.  Michig…
 2 12000 Jonathan K… GSW                  19          201.         102.  None   
 3 12001 Klay Thomp… GSW                  32          198.          99.8 Washin…
 4 12002 Jordan Cla… UTA                  30          193.          88.0 Missou…
 5 12003 Santi Alda… MEM                  21          211.          97.5 Loyola…
 6 12004 Sandro Mam… MIL                  23          206.         109.  Seton …
 7 12005 Sam Merrill MEM                  26          193.          93.0 Utah S…
 8 12006 Sam Hauser  BOS                  24          201.          98.4 Virgin…
 9 12007 Sam Dekker  TOR                  28          203.          99.3 Wiscon…
10 12008 Saddiq Bey  DET                  23          201.          97.5 Villan…
# ℹ 296 more rows
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
#   star_support <dbl>

Note, however, that this method uses the number of rows (nrow()) in the full data frame. As a result, this will probably only work if we are applying slice() directly to our nba data frame and not do not have any additional verbs in our chain before our call to slice(). How can we grab the last N rows if we don’t (necessarily) know how many rows are in the data frame we are applying slice() to? A more robust approach would be to use the closely related slice_tail() function. This function is specifically designed to produce the last rows of a data frame:

nba %>%
    slice_tail(n = 10)
# A tibble: 10 × 24
    ...1 player_name team_abbreviation   age player_height player_weight college
   <dbl> <chr>       <chr>             <dbl>         <dbl>         <dbl> <chr>  
 1 12295 Mason Jones LAL                  23          193.          90.7 Arkans…
 2 12296 Marvin Bag… DET                  23          211.         107.  Duke   
 3 12297 Marquese C… DAL                  24          206.         109.  Washin…
 4 12298 Markus How… DEN                  23          178.          79.4 Marque…
 5 12299 Marko Simo… CHI                  22          211.          99.8 None   
 6 12300 Markieff M… MIA                  32          206.         111.  Kansas 
 7 12301 Markelle F… ORL                  24          193.          94.8 Washin…
 8 12302 Marcus Sma… BOS                  28          193.          99.8 Oklaho…
 9 12303 Marcus Gar… MIA                  23          196.          93.0 Kansas 
10 12304 Micah Pott… DET                  24          208.         112.  Wiscon…
# ℹ 17 more variables: country <chr>, draft_year <dbl>, draft_round <dbl>,
#   draft_number <dbl>, gp <dbl>, pts <dbl>, reb <dbl>, ast <dbl>,
#   net_rating <dbl>, oreb_pct <dbl>, dreb_pct <dbl>, usg_pct <dbl>,
#   ts_pct <dbl>, ast_pct <dbl>, season <chr>, player_height_in_inch <dbl>,
#   star_support <dbl>

Ok. This all seems fine. But unless you know ahead of time what data might live in particular rows, this may not seem all that useful. The real power comes when you combine slice() with something like arrange(). Let’s see how these verbs work together to produce something a bit more useful.

nba %>%
    arrange(desc(pts)) %>%
    slice(1:10) %>%
    select("player_name", "season", "pts")
# A tibble: 10 × 3
   player_name       season    pts
   <chr>             <chr>   <dbl>
 1 James Harden      2018-19  36.1
 2 Kobe Bryant       2005-06  35.4
 3 James Harden      2019-20  34.3
 4 Allen Iverson     2005-06  33  
 5 Tracy McGrady     2002-03  32.1
 6 Kevin Durant      2013-14  32  
 7 Stephen Curry     2020-21  32  
 8 Kobe Bryant       2006-07  31.6
 9 Russell Westbrook 2016-17  31.6
10 Allen Iverson     2001-02  31.4

This code first sorts the data frame by the pts column (highest to lowest because of our application of desc()) and the grabs the first 10 rows. Because we first use arrange(), we know that these 10 rows are associated with the 10 highest season scoring averages in our data set. We then select() a relatively small number of columns so we can view it easily.

We have seen slice() and slice_tail() in action. But there are more functions in the family. Like slice_tail(), there is also slice_head(), which selected the first rows of a data frame. There are also the functions slice_min() and slice_max(), which combine sorting and slicing operations. So we can redo our code above to find the 10 highest season scoring averages in our data set using slice_max():

nba %>%
    slice_max(pts, n = 10) %>%
    select("player_name", "season", "pts")
# A tibble: 11 × 3
   player_name       season    pts
   <chr>             <chr>   <dbl>
 1 James Harden      2018-19  36.1
 2 Kobe Bryant       2005-06  35.4
 3 James Harden      2019-20  34.3
 4 Allen Iverson     2005-06  33  
 5 Tracy McGrady     2002-03  32.1
 6 Kevin Durant      2013-14  32  
 7 Stephen Curry     2020-21  32  
 8 Kobe Bryant       2006-07  31.6
 9 Russell Westbrook 2016-17  31.6
10 Allen Iverson     2001-02  31.4
11 LeBron James      2005-06  31.4

All of these functions (except slice() itself) accept either an n argument (the number of rows to grab) or a prop argument (the proportion of all rows to grab). As with all functions in dplyr and the tidyverse, the real power is ultimately realized by including these function into extended chains of operations. You can group_by and then slice(), you can mutate() and then arrange() and then slice_tail(), etc. etc. There are an infinite possibilities.

7.10 Activities

If you haven’t already, load the nba data set we downloaded back in Section 4.2. And let’s pass c("Undrafted") as the na argument:

nba <- read_csv("./data/nba_all_seasons.csv", na = c("Undrafted"))
  1. Produce a data frame consisting exclusively of players under the age of 25.
  2. Produce a data frame consisting exclusively of players who attended “Marquette”.
  3. Produce a data frame consisting exclusively of players who are either under the age of 25 or attended “Marquette”.
  4. Produce a data frame consisting exclusively of players under the age of 25 and attended “Marquette”.
  5. Determine the average height of all players.
  6. Determine the minimum weight in the nba data set.
  7. Determine the range of ages seen in the nba data set.
  8. Determine the interquartile range of all pts seen in the nba data set.
  9. Determine the median height of all players who attended “Connecticut”.
  10. Use select to produce a data frame that only includes columns containing player names and team abbreviations.
  11. Use select to produce a data frame that includes all columns except those containing player names and team abbreviations.
  12. Use select to produce a data frame that includes all columns whose column names include the string “ea” (hint: use constains()).
  13. Use select to produce a data frame that includes all columns whose column names does not include an underscore (i.e., “_“).
  14. Bonus: Use select to produce a data frame that includes all columns whose column names contain more than 11 characters (hint: you may need to use some combination of names() and nchar()).
  15. For each college appearing in the data set, what is the average (mean) of pts and ast among players who attended that college?
  16. Calculate the mean of player_weight and the mean of player_height for each season in the data set.
  17. Try calculating these per-school averages separately for each season (e.g. means for each combination of college and season).
  18. Create a new column in the data set called reb_diff and calculate it as the difference between the offensive rebound percentage (oreb_pct) and the defensive rebound percentage (dreb_pct).
  19. Calculate the mean and standard deviation of reb_diff.
  20. Find the 10 largest values of reb_diff.
  21. Create a new column in the data set called pts_over_mean and calculate it as the difference between that row’s pts value and the overall mean of pts in the entire data set.
  22. Generate a data frame with 3 columns: the player’s name, season, and pts_over_mean so that you can quickly view your new column and who and when it is associated with.
  23. Use filter() to find those rows corresponding to players who went to “UCLA”, and then use arrange() to order the data by pts.
  24. For each college appearing in the data set, what is the average ptsof players who attended that college. Which college has produced the highest average? Which college has produced the lowest average?
  25. Use filter() to select those data associated with the New York Knicks (abbreviated NYK), and then use select() to remove the column containing the team abbreviation column. Next, remove all undrafted players (e.g., those who do not have a valid draft_year). Of these players, find the 6 players with the highest pts. If there are duplicates (i.e., any player appearing multiple times in the top 6), find the 6 unique players with the highest pts (so that there are 6 players but no duplicates).
  26. Group the data by college and team_abbreviation. Now calculate the sum of pts for each of these combinations. Now add up all these means. You should have a single number (the sum of some large number of means). Does the chain of operations you initially try work as expected? Does the documentation on dplyr’s ungroup() function help?