6  Dataframes

Throughout the R universe, data is stored in what are referred to as dataframes. A dataframe is a tabular data format that works a bit like a spreadsheet. Within the tidyverse, dataframes are organized in a particular way (“tidy” format) and will tend to be stored in a particular type of dataframe (a “tibble”). In this chapter, we’ll explore these conventions.

Columns within a dataframe store data of a certain type (e.g., decimals, strings, etc.). Let’s take a quick look at a standard R dataframe. We’ll use one of the built-in data sets, iris, which contains data about measurements of various flowers. For now, we’ll just take a look at the first five rows:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

As we can see, each row corresponds to an observation (a flower) and each columns represents a measurement from that flower.

6.1 What is a tibble?

A tibble is a type of dataframe commonly used in the tidyverse. Tibbles contain the same basic information as a corresponding dataframe, but tibbles are slightly different in a variety of minor ways. Let’s take a look at the iris data represented as a tibble. We’ll again confine our inspection to the first five rows:

# A tibble: 5 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 

Note that the data is identical to that seen above. However, the tibble provides some additional useful information. It tells us that this particular tibble has five columns and, because we only asked for the first five rows, also has five rows. In addition, it tells us about the data type of each column. The first four columns are of type dbl, which is double-precision floating point number (a decimal). The last column is of type fct, or factor. In R, factors are used to represent categorical variables, variables that have a fixed set of possible values (in this case, the set of species).

6.2 Tidying your data

For the rest of this book, we will primarily deal with data that is already in “tidy” format as explained Chapter 5. However, many data sets you encounter in the world are in so-called “wide” format. If you wish to use the tidyverse packages, you will first have to convert these date sets to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package.

To illustrated, let’s load some data from the fivethirtyeight package. The fivethirtyeight package provides access to the data sets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all data sets included in the fivethirtyeight package, check out the package webpage.

Let’s focus our attention on the drinks dataframe:

# A tibble: 5 × 5
  country     beer_servings spirit_servings wine_servings total_litres_of_pure…¹
  <chr>               <int>           <int>         <int>                  <dbl>
1 Afghanistan             0               0             0                    0  
2 Albania                89             132            54                    4.9
3 Algeria                25               0            14                    0.7
4 Andorra               245             138           312                   12.4
5 Angola                217              57            45                    5.9
# ℹ abbreviated name: ¹​total_litres_of_pure_alcohol

After reading the help file (i.e., by running ?drinks), you’ll see that drinks is a dataframe containing results from a survey of beer, spirits, and wine consumption originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.

Let’s narrow down the data a bit. We’ll only consider 4 countries (the United States, China, Italy, and Saudi Arabia), omit the total_litres_of_pure_alcohol variable, and rename the other variables to something a bit more convenient. Don’t worry about the code here. We’ll get into all of these operations more in Chapter 7.

drinks_smaller <- drinks %>% 
  filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>% 
  select(-total_litres_of_pure_alcohol) %>% 
  rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
# A tibble: 4 × 4
  country       beer spirit  wine
  <chr>        <int>  <int> <int>
1 China           79    192     8
2 Italy           85     42   237
3 Saudi Arabia     0      5     0
4 USA            249    158    84

Note that this data is not in “tidy” format. However, we can convert it to tidy format by using the pivot_longer() function from the tidyr package as follows:

drinks_smaller_tidy <- drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = -country)
drinks_smaller_tidy
# A tibble: 12 × 3
   country      type   servings
   <chr>        <chr>     <int>
 1 China        beer         79
 2 China        spirit      192
 3 China        wine          8
 4 Italy        beer         85
 5 Italy        spirit       42
 6 Italy        wine        237
 7 Saudi Arabia beer          0
 8 Saudi Arabia spirit        5
 9 Saudi Arabia wine          0
10 USA          beer        249
11 USA          spirit      158
12 USA          wine         84

We set the arguments to pivot_longer() as follows:

  1. names_to here corresponds to the name of the variable in the new “tidy”/long dataframe that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here.
  2. values_to here is the name of the variable in the new “tidy” dataframe that will contain the values of the original data. Observe how we set values_to = "servings" since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 \(\times\) 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument.
  3. The third argument cols is the columns in the drinks_smaller dataframe you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it.

The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = c(beer, spirit, wine))
# A tibble: 12 × 3
   country      type   servings
   <chr>        <chr>     <int>
 1 China        beer         79
 2 China        spirit      192
 3 China        wine          8
 4 Italy        beer         85
 5 Italy        spirit       42
 6 Italy        wine        237
 7 Saudi Arabia beer          0
 8 Saudi Arabia spirit        5
 9 Saudi Arabia wine          0
10 USA          beer        249
11 USA          spirit      158
12 USA          wine         84

Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller dataframe, we could also do the following for the cols argument:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = beer:wine)
# A tibble: 12 × 3
   country      type   servings
   <chr>        <chr>     <int>
 1 China        beer         79
 2 China        spirit      192
 3 China        wine          8
 4 Italy        beer         85
 5 Italy        spirit       42
 6 Italy        wine        237
 7 Saudi Arabia beer          0
 8 Saudi Arabia spirit        5
 9 Saudi Arabia wine          0
10 USA          beer        249
11 USA          spirit      158
12 USA          wine         84

Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different data sets. For example, run ?pivot_longer and look at the examples in the bottom of the help file.

If however you want to convert a “tidy” dataframe to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples.

You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a data set in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.

6.3 Exercises

  • Let’s load data we downloaded back in Section 4.2. Run the following:
nba <- read_csv("./data/nba_all_seasons.csv")
  • Now let’s take a quick peek at what’s in there by running the following:
head(nba)
  • Consider the following questions:
    1. How many columns are there in nba?
    2. How many rows are there in nba?
    3. What is the data type of the player_height column?
    4. What is the data type of the player_weight column?
    5. What is the data type of the pts column?
    6. What is the data type of the draft_year column?
    7. What is the data type of the draft_round column?
    8. What is the data type of the draft_number column?
  • Let’s take a moment to talk about what this data set contains. This data reflects statistics for players in the National Basketball Association (NBA). The data covers seasons from 1996-1997 to 2021-2022 (NBA seasons run from approximately October to June of the following year). Each row represents a player . Columns include each player’s name (player_name), the team the player played for (team_abbreviation), where the player went to college (college), the player’s height (player_height) and weight (player_weight), the year, round, and order in which the player was drafted (draft_year, draft_round, and draft_number) as well as a variety of statistics about the player’s performance (e.g., points scored, pts).
  1. Do any of these data types seem incorrect? Which ones?
  2. Let’s take a closer look. Open the data in RStudio’s spreadsheet viewer by running the following:
View(nba)
  1. Inspect the college column. Do you see any unusual values?
  2. Given what you see in the tibble and the description above, do you think this data is in tidy format?