Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 Dataframes
Throughout the R universe, data is stored in what are referred to as dataframes. A dataframe is a tabular data format that works a bit like a spreadsheet. Within the tidyverse, dataframes are organized in a particular way (“tidy” format) and will tend to be stored in a particular type of dataframe (a “tibble”). In this chapter, we’ll explore these conventions.
Columns within a dataframe store data of a certain type (e.g., decimals, strings, etc.). Let’s take a quick look at a standard R dataframe. We’ll use one of the built-in data sets, iris
, which contains data about measurements of various flowers. For now, we’ll just take a look at the first five rows:
As we can see, each row corresponds to an observation (a flower) and each columns represents a measurement from that flower.
6.1 What is a tibble?
A tibble is a type of dataframe commonly used in the tidyverse. Tibbles contain the same basic information as a corresponding dataframe, but tibbles are slightly different in a variety of minor ways. Let’s take a look at the iris
data represented as a tibble. We’ll again confine our inspection to the first five rows:
# A tibble: 5 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
Note that the data is identical to that seen above. However, the tibble provides some additional useful information. It tells us that this particular tibble has five columns and, because we only asked for the first five rows, also has five rows. In addition, it tells us about the data type of each column. The first four columns are of type dbl
, which is double-precision floating point number (a decimal). The last column is of type fct
, or factor. In R, factors are used to represent categorical variables, variables that have a fixed set of possible values (in this case, the set of species).
6.2 Tidying your data
For the rest of this book, we will primarily deal with data that is already in “tidy” format as explained Chapter 5. However, many data sets you encounter in the world are in so-called “wide” format. If you wish to use the tidyverse packages, you will first have to convert these date sets to “tidy” format. To do so, we recommend using the pivot_longer()
function in the tidyr
package.
To illustrated, let’s load some data from the fivethirtyeight
package. The fivethirtyeight
package provides access to the data sets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all data sets included in the fivethirtyeight
package, check out the package webpage.
Let’s focus our attention on the drinks
dataframe:
# A tibble: 5 × 5
country beer_servings spirit_servings wine_servings total_litres_of_pure…¹
<chr> <int> <int> <int> <dbl>
1 Afghanistan 0 0 0 0
2 Albania 89 132 54 4.9
3 Algeria 25 0 14 0.7
4 Andorra 245 138 312 12.4
5 Angola 217 57 45 5.9
# ℹ abbreviated name: ¹total_litres_of_pure_alcohol
After reading the help file (i.e., by running ?drinks
), you’ll see that drinks
is a dataframe containing results from a survey of beer, spirits, and wine consumption originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.
Let’s narrow down the data a bit. We’ll only consider 4 countries (the United States, China, Italy, and Saudi Arabia), omit the total_litres_of_pure_alcohol
variable, and rename the other variables to something a bit more convenient. Don’t worry about the code here. We’ll get into all of these operations more in Chapter 7.
<- drinks %>%
drinks_smaller filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>%
select(-total_litres_of_pure_alcohol) %>%
rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
# A tibble: 4 × 4
country beer spirit wine
<chr> <int> <int> <int>
1 China 79 192 8
2 Italy 85 42 237
3 Saudi Arabia 0 5 0
4 USA 249 158 84
Note that this data is not in “tidy” format. However, we can convert it to tidy format by using the pivot_longer()
function from the tidyr
package as follows:
<- drinks_smaller %>%
drinks_smaller_tidy pivot_longer(names_to = "type",
values_to = "servings",
cols = -country)
drinks_smaller_tidy
# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
We set the arguments to pivot_longer()
as follows:
names_to
here corresponds to the name of the variable in the new “tidy”/long dataframe that will contain the column names of the original data. Observe how we setnames_to = "type"
. In the resultingdrinks_smaller_tidy
, the columntype
contains the three types of alcoholbeer
,spirit
, andwine
. Sincetype
is a variable name that doesn’t appear indrinks_smaller
, we use quotation marks around it. You’ll receive an error if you just usenames_to = type
here.values_to
here is the name of the variable in the new “tidy” dataframe that will contain the values of the original data. Observe how we setvalues_to = "servings"
since each of the numeric values in each of thebeer
,wine
, andspirit
columns of thedrinks_smaller
data corresponds to a value ofservings
. In the resultingdrinks_smaller_tidy
, the columnservings
contains the 4 \(\times\) 3 = 12 numerical values. Note again thatservings
doesn’t appear as a variable indrinks_smaller
so it again needs quotation marks around it for thevalues_to
argument.- The third argument
cols
is the columns in thedrinks_smaller
dataframe you either want to or don’t want to “tidy.” Observe how we set this to-country
indicating that we don’t want to “tidy” thecountry
variable indrinks_smaller
and rather onlybeer
,spirit
, andwine
. Sincecountry
is a column that appears indrinks_smaller
we don’t put quotation marks around it.
The third argument here of cols
is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:
%>%
drinks_smaller pivot_longer(names_to = "type",
values_to = "servings",
cols = c(beer, spirit, wine))
# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine)
, instead of the columns we don’t want to “tidy” using -country
. We use the c()
function to create a vector of the columns in drinks_smaller
that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller
dataframe, we could also do the following for the cols
argument:
%>%
drinks_smaller pivot_longer(names_to = "type",
values_to = "servings",
cols = beer:wine)
# A tibble: 12 × 3
country type servings
<chr> <chr> <int>
1 China beer 79
2 China spirit 192
3 China wine 8
4 Italy beer 85
5 Italy spirit 42
6 Italy wine 237
7 Saudi Arabia beer 0
8 Saudi Arabia spirit 5
9 Saudi Arabia wine 0
10 USA beer 249
11 USA spirit 158
12 USA wine 84
Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer()
function is with practice, practice, and more practice using different data sets. For example, run ?pivot_longer
and look at the examples in the bottom of the help file.
If however you want to convert a “tidy” dataframe to “wide” format, you will need to use the pivot_wider()
function instead. Run ?pivot_wider
and look at the examples in the bottom of the help file for examples.
You can also view examples of both pivot_longer()
and pivot_wider()
on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a data set in the weekly #
TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.
6.3 Exercises
- Let’s load data we downloaded back in Section 4.2. Run the following:
<- read_csv("./data/nba_all_seasons.csv") nba
- Now let’s take a quick peek at what’s in there by running the following:
head(nba)
- Consider the following questions:
- How many columns are there in
nba
? - How many rows are there in
nba
? - What is the data type of the
player_height
column? - What is the data type of the
player_weight
column? - What is the data type of the
pts
column? - What is the data type of the
draft_year
column? - What is the data type of the
draft_round
column? - What is the data type of the
draft_number
column?
- How many columns are there in
- Let’s take a moment to talk about what this data set contains. This data reflects statistics for players in the National Basketball Association (NBA). The data covers seasons from 1996-1997 to 2021-2022 (NBA seasons run from approximately October to June of the following year). Each row represents a player . Columns include each player’s name (
player_name
), the team the player played for (team_abbreviation
), where the player went to college (college
), the player’s height (player_height
) and weight (player_weight
), the year, round, and order in which the player was drafted (draft_year
,draft_round
, anddraft_number
) as well as a variety of statistics about the player’s performance (e.g., points scored,pts
).
- Do any of these data types seem incorrect? Which ones?
- Let’s take a closer look. Open the data in RStudio’s spreadsheet viewer by running the following:
View(nba)
- Inspect the
college
column. Do you see any unusual values? - Given what you see in the tibble and the description above, do you think this data is in tidy format?