Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 Dataframes
Throughout the R universe, data is stored in what are referred to as dataframes. A dataframe is a tabular data format that works a bit like a spreadsheet. Within the tidyverse, dataframes are organized in a particular way (“tidy” format) and will tend to be stored in a particular type of dataframe (a “tibble”). In this chapter, we’ll explore these conventions.
Columns within a dataframe store data of a certain type (e.g., decimals, strings, etc.). Let’s take a quick look at a standard R dataframe. We’ll use one of the built-in data sets, iris, which contains data about measurements of various flowers. For now, we’ll just take a look at the first five rows:
As we can see, each row corresponds to an observation (a flower) and each columns represents a measurement from that flower.
6.1 What is a tibble?
A tibble is a type of dataframe commonly used in the tidyverse. Tibbles contain the same basic information as a corresponding dataframe, but tibbles are slightly different in a variety of minor ways. Let’s take a look at the iris data represented as a tibble. We’ll again confine our inspection to the first five rows:
# A tibble: 5 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
Note that the data is identical to that seen above. However, the tibble provides some additional useful information. It tells us that this particular tibble has five columns and, because we only asked for the first five rows, also has five rows. In addition, it tells us about the data type of each column. The first four columns are of type dbl, which is double-precision floating point number (a decimal). The last column is of type fct, or factor. In R, factors are used to represent categorical variables, variables that have a fixed set of possible values (in this case, the set of species).
6.2 Tidying your data
For the rest of this book, we will primarily deal with data that is already in “tidy” format as explained Chapter 5. However, many data sets you encounter in the world are in so-called “wide” format. If you wish to use the tidyverse packages, you will first have to convert these date sets to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package.
To illustrated, let’s load some data from the fivethirtyeight package. The fivethirtyeight package provides access to the data sets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all data sets included in the fivethirtyeight package, check out the package webpage.
Let’s focus our attention on the drinks dataframe:
# A tibble: 5 × 5
country beer_servings spirit_servings wine_servings total_litres_of_pure…¹
<chr> <int> <int> <int> <dbl>
1 Afghanistan 0 0 0 0
2 Albania 89 132 54 4.9
3 Algeria 25 0 14 0.7
4 Andorra 245 138 312 12.4
5 Angola 217 57 45 5.9
# ℹ abbreviated name: ¹total_litres_of_pure_alcohol
After reading the help file (i.e., by running ?drinks), you’ll see that drinks is a dataframe containing results from a survey of beer, spirits, and wine consumption originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.
Let’s narrow down the data a bit. We’ll only consider 4 countries (the United States, China, Italy, and Saudi Arabia), omit the total_litres_of_pure_alcohol variable, and rename the other variables to something a bit more convenient. Don’t worry about the code here. We’ll get into all of these operations more in Chapter 7.
drinks_smaller <- drinks %>%
filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>%
select(-total_litres_of_pure_alcohol) %>%
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller# A tibble: 4 × 4
country beer spirit wine
<chr> <int> <int> <int>
1 China 79 192 8
2 Italy 85 42 237
3 Saudi Arabia 0 5 0
4 USA 249 158 84
Note that this data is not in “tidy” format. However, we can convert it to tidy format by using the pivot_longer() function from the tidyr package as follows:
drinks_smaller_tidy <- drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = -country)
drinks_smaller_tidy# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
We set the arguments to pivot_longer() as follows:
names_tohere corresponds to the name of the variable in the new “tidy”/long dataframe that will contain the column names of the original data. Observe how we setnames_to = "type". In the resultingdrinks_smaller_tidy, the columntypecontains the three types of alcoholbeer,spirit, andwine. Sincetypeis a variable name that doesn’t appear indrinks_smaller, we use quotation marks around it. You’ll receive an error if you just usenames_to = typehere.values_tohere is the name of the variable in the new “tidy” dataframe that will contain the values of the original data. Observe how we setvalues_to = "servings"since each of the numeric values in each of thebeer,wine, andspiritcolumns of thedrinks_smallerdata corresponds to a value ofservings. In the resultingdrinks_smaller_tidy, the columnservingscontains the 4 \(\times\) 3 = 12 numerical values. Note again thatservingsdoesn’t appear as a variable indrinks_smallerso it again needs quotation marks around it for thevalues_toargument.- The third argument
colsis the columns in thedrinks_smallerdataframe you either want to or don’t want to “tidy.” Observe how we set this to-countryindicating that we don’t want to “tidy” thecountryvariable indrinks_smallerand rather onlybeer,spirit, andwine. Sincecountryis a column that appears indrinks_smallerwe don’t put quotation marks around it.
The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = c(beer, spirit, wine))# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller dataframe, we could also do the following for the cols argument:
drinks_smaller %>%
pivot_longer(names_to = "type",
values_to = "servings",
cols = beer:wine)# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different data sets. For example, run ?pivot_longer and look at the examples in the bottom of the help file.
If however you want to convert a “tidy” dataframe to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples.
You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a data set in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.
6.3 Exercises
- Let’s load data we downloaded back in Section 4.2. Run the following:
nba <- read_csv("./data/nba_all_seasons.csv")- Now let’s take a quick peek at what’s in there by running the following:
head(nba)- Consider the following questions:
- How many columns are there in
nba? - How many rows are there in
nba? - What is the data type of the
player_heightcolumn? - What is the data type of the
player_weightcolumn? - What is the data type of the
ptscolumn? - What is the data type of the
draft_yearcolumn? - What is the data type of the
draft_roundcolumn? - What is the data type of the
draft_numbercolumn?
- How many columns are there in
- Let’s take a moment to talk about what this data set contains. This data reflects statistics for players in the National Basketball Association (NBA). The data covers seasons from 1996-1997 to 2021-2022 (NBA seasons run from approximately October to June of the following year). Each row represents a player . Columns include each player’s name (
player_name), the team the player played for (team_abbreviation), where the player went to college (college), the player’s height (player_height) and weight (player_weight), the year, round, and order in which the player was drafted (draft_year,draft_round, anddraft_number) as well as a variety of statistics about the player’s performance (e.g., points scored,pts).
- Do any of these data types seem incorrect? Which ones?
- Let’s take a closer look. Open the data in RStudio’s spreadsheet viewer by running the following:
View(nba)- Inspect the
collegecolumn. Do you see any unusual values? - Given what you see in the tibble and the description above, do you think this data is in tidy format?