2  R as a Calculator

2.1 Commands at the console

One of the easiest things you can do with R is use it as a simple calculator, so it’s a good place to start. For instance, try typing 10 + 20, and hitting enter. The simple act of typing it rather than “just reading” makes a big difference. It makes the concepts more concrete, and it ties the abstract ideas (programming and statistics) to the actual context in which you need to use them. Statistics is something you do, not just something you read about in a textbook.] When you do this, you’ve entered a command, and R will “execute” that command. What you see on screen now will be this:

> 10 + 20
[1] 30

Should be much surprise here. But there’s a few things worth talking about, even with such a simple example. First, it’s important that you understand how to read the code example. In this example, what was typed into the RStudio console was the 10 + 20 part. The > symbol as not typed. That’s just the command prompt and isn’t part of the actual command. The [1] 30 part was also not typed into the console. That’s what R printed out in response to the 10 + 20 code.

Second, it’s important to understand how the output is formatted. Obviously, the correct answer to the sum 10 + 20 is 30, and not surprisingly R has printed that out as part of its output. But it’s also printed out this [1] part, which probably doesn’t make a lot of sense to you right now. You’re going to see that a lot. I’ll talk about what this means in a bit more detail later on, but for now you can think of [1] 30 as if R were saying “the answer to the 1st question you asked is 30”. That’s not quite accurate, but it’s close enough for now. And in any case it’s not really very interesting at the moment: we only asked R to calculate one thing, so obviously there’s only one answer. Later on this will change, and the [1] part will start to make a bit more sense. For now, I just don’t want you to get confused or concerned by it.

2.1.1 An important digression about formatting

Now that I’ve taught you these rules I’m going to change them pretty much immediately. That is because I want you to be able to copy code from the book directly into R if if you want to test things or conduct your own analyses. However, if you copy this kind of code (that shows the command prompt and the results) directly into R you will get an error:

> 10 + 20
Error: <text>:1:1: unexpected '>'
1: >
    ^

So instead, I’m going to provide code in a slightly different format so that it looks like this…

10 + 20
[1] 30

There are two main differences.

  • In your console, the “>” is the prompt and you type your code after (to the right of) this prompt.
  • We’ll often show the output of a bit of code, but the output will be displayed after the block of code itself.

For your purposes, this also means that you can easily copy code from any of these code blocks and paste it into your RStudio console in order to execute.

2.1.2 Be very careful to avoid typos

Before we go on to talk about other types of calculations that we can do with R, there’s a few other things I want to point out. The first thing is that, though R is good software, it’s still software. R, like any programming language, is pretty stupid and because it’s stupid it can’t handle typos. It takes it on faith that you meant to type exactly what you actually typed. For example, suppose that you forgot to hit the shift key when trying to type +, and as a result your command ended up being 10 = 20 rather than 10 + 20. Here’s what happens:

10 = 20
Error in 10 = 20: invalid (do_set) left-hand side to assignment

What’s happened here is that R has attempted to interpret 10 = 20 as a command, and spits out an error message because the command doesn’t make any sense to it. When a human looks at this, and then looks down at his or her keyboard and sees that + and = are on the same key, it’s pretty obvious that the command was a typo. But R doesn’t know this, so it gets upset. And, if you look at it from its perspective, this makes sense. All that R “knows” is that 10 is a legitimate number, 20 is a legitimate number, and = is a legitimate part of the language too. In other words, from its perspective this really does look like the user meant to type 10 = 20, since all the individual parts of that statement are legitimate and it’s too stupid to realize that this is probably a typo. Therefore, R takes it on faith that this is exactly what you meant… it only “discovers” that the command is nonsense when it tries to follow your instructions, typo and all. And then it complains by spitting out an error.

Even more subtle is the fact that some typos won’t produce errors at all, because they happen to correspond to “well-formed” R commands. For instance, suppose that not only did I forget to hit the shift key when trying to type 10 + 20, I also managed to press the key next to one I meant do. The resulting typo would produce the command 10 - 20. Clearly, R has no way of knowing that you meant to add 20 to 10, not subtract 20 from 10, so what happens this time is this:

10 - 20
[1] -10

In this case, R produces the right answer, but to the the wrong question.

To some extent, I’m stating the obvious here, but it’s important. The people who wrote R are smart. You, the user, are smart. But R is a programming language and programming languages are a way to tell computers what to do and computers are dumb. And because they are dumb, they are mindlessly obedient. R does exactly what you tell it to do. R will not try and second-guess what you “actually meant”; there is no “autocorrect”. This is for good reason. When doing advanced stuff – and even the simplest of statistics is pretty advanced in a lot of ways – it’s risky to let a mindless automaton like R try to overrule the human user. So it’s your responsibility to be careful. Always make sure you type exactly what you mean. When dealing with computers, it’s not enough to type “approximately” the right thing. In general, you absolutely must be precise in what you tell R to do … like all machines it is too stupid to be anything other than absurdly literal in its interpretation.

2.1.3 R is (a bit) flexible with spacing

Of course, now that I’ve been so uptight about the importance of always being precise, I should point out that there are some exceptions. Or, more accurately, there are some situations in which R does show a bit more flexibility than my previous description suggests. The first thing R is smart enough to do is ignore redundant spacing. What I mean by this is that, when I typed 10 + 20 before, I could equally have done this

10    + 20
[1] 30

or this

10+20
[1] 30

2.2 Simple calculations

Okay, now that we’ve discussed some of the tedious details associated with typing R commands, let’s get back to learning how to use the most powerful piece of statistical software in the world as a $2 calculator. So far, all we know how to do is addition. Clearly, a calculator that only did addition would be a bit stupid, so we’ll discuss other simple calculations you can perform using R. But first, some more terminology. Addition is an example of an “operation” that you can perform (specifically, an arithmetic operation), and the operator that performs it is +. To people with a programming or mathematics background, this terminology probably feels pretty natural, but to other people it might feel like I’m trying to make something very simple (addition) sound more complicated than it is (by calling it an arithmetic operation). To some extent, that’s true: if addition was the only operation that we were interested in, it’d be a bit silly to introduce all this extra terminology. However, as we go along, we’ll start using more and more different kinds of operations, so it’s probably a good idea to get the language straight now, while we’re still talking about very familiar concepts like addition!

2.2.1 Adding, subtracting, multiplying and dividing

So, now that we have the terminology, let’s learn how to perform some arithmetic operations in R. To that end, Figure 2.1 lists the operators that correspond to the basic arithmetic we learned in primary school: addition, subtraction, multiplication and division.

operation operator example input example output
addition + 10 + 2 12
subtraction - 9 - 3 6
multiplication * 5 * 5 25
division / 10 / 3 3.333333
power ^ 5 ^ 2 25
Figure 2.1: Basic arithmetic operations in R. These five operators are used very frequently throughout the text, so it’s important to be familiar with them at the outset.

As you can see, R uses fairly standard symbols to denote each of the different operations you might want to perform: addition is done using the + operator, subtraction is performed by the - operator, and so on. So if I wanted to find out what 57 times 61 is (and who wouldn’t?), I can use R instead of a calculator, like so:

57 * 61
[1] 3477

So that’s handy.

2.2.2 Taking powers

The first four operations listed in Figure 2.1 are things we all learned at a young age, but they aren’t the only arithmetic operations built into R. There are three other arithmetic operations that I should probably mention: taking powers, doing integer division, and calculating a modulus. Of the three, the most important is probably taking powers.

For those of you who can still remember your high school math, this should be familiar. And if not, it’s not complicated. As I’m sure everyone will probably remember the moment they read this, the act of multiplying a number \(x\) by itself \(n\) times is called “raising \(x\) to the \(n\)-th power”. Mathematically, this is written as \(x^n\). Some values of \(n\) have special names: in particular \(x^2\) is called \(x\)-squared, and \(x^3\) is called \(x\)-cubed. So, the 4th power of 5 is calculated like this:

\[ 5^4 = 5 \times 5 \times 5 \times 5 \] One way that we could calculate \(5^4\) in R would be to type in the complete multiplication as it is shown in the equation above. That is, we could do this

5 * 5 * 5 * 5
[1] 625

but it does seem a bit tedious. It would be very annoying indeed if you wanted to calculate \(5^{15}\), since the command would end up being quite long. Therefore, to make our lives easier, we use the power operator instead. When we do that, our command to calculate \(5^4\) goes like this:

5 ^ 4
[1] 625

Much easier.

2.2.3 Doing calculations in the right order

Okay. At this point, you know how to take one of the most powerful pieces of statistical software in the world, and use it as a $2 calculator. And as a bonus, you’ve learned a few very basic programming concepts. That’s not nothing (you could argue that you’ve just saved yourself $2) but on the other hand, it’s not very much either. In order to use R more effectively, we need to introduce more programming concepts.

In most situations where you would want to use a calculator, you might want to do multiple calculations. R lets you do this, just by typing in longer commands. In fact, we’ve already seen an example of this earlier, when I typed in 5 * 5 * 5 * 5. However, let’s try a slightly different example:

1 + 2 * 4
[1] 9

Clearly, this isn’t a problem for R either. However, it’s worth stopping for a second, and thinking about what R just did. Clearly, since it gave us an answer of 9 it must have multiplied 2 * 4 (to get an interim answer of 8) and then added 1 to that. But, suppose it had decided to just go from left to right: if R had decided instead to add 1+2 (to get an interim answer of 3) and then multiplied by 4, it would have come up with an answer of 12.

To answer this, you need to know the order of operations that R uses. If you remember back to your high school maths classes, it’s actually the same order that you got taught when you were at school: the “BEDMAS” order. That is, first calculate things inside Brackets (), then calculate Exponents ^, then Division / and Multiplication *, then Addition + and Subtraction -. So, to continue the example above, if we want to force R to calculate the 1+2 part before the multiplication, all we would have to do is enclose it in brackets:

(1 + 2) * 4 
[1] 12

This is a fairly useful thing to be able to do. The only other thing I should point out about order of operations is what to expect when you have two operations that have the same priority: that is, how does R resolve ties? For instance, multiplication and division are actually the same priority, but what should we expect when we give R a problem like 4 / 2 * 3 to solve? If it evaluates the multiplication first and then the division, it would calculate a value of two-thirds. But if it evaluates the division first it calculates a value of 6. The answer, in this case, is that R goes from left to right, so in this case the division step would come first:

4 / 2 * 3
[1] 6

All of the above being said, it’s helpful to remember that parentheses always come first. So, if you’re ever unsure about what order R will do things in, an easy solution is to enclose the thing you want it to do first in parentheses In addition, making the order of operations explicit makes your code more readable. By enclosing the division in parentheses (e.g., (4 / 2) * 3) we make it clear which thing happens first.

2.3 Storing a number as a variable

One of the most important things to be able to do in R (or any programming language, for that matter) is to store information in variables. At a conceptual level you can think of a variable as label for a certain piece of information, or even several different pieces of information. For example, when using R as a calculator, there may be times when you want to store an intermediate result along the way. For example, when calculating an average (the sum divided by the count), you might wish to save the sum before dividing that sum by the count. Let’s look at the very basics for how we create variables and work with them.

2.3.1 Variable assignment using <-

Since we’ve been working with numbers so far, let’s start by creating variables to store our numbers. And since most people like concrete examples, let’s invent one. Suppose I’m trying to calculate how much money I’m going to make selling this book. There’s several different numbers I might want to store. Firstly, I need to figure out how many copies I’ll sell. This isn’t exactly Harry Potter, so let’s assume I’m only going to sell one copy per student in my class. Let’s assume there are 30 students, so that’s 30 sales. Let’s create a variable called sales. What I want to do is assign a value to my variable sales, and that value should be 30. We do this by using the assignment operator, which is <-. Here’s how we do it:

sales <- 30

When you hit enter, R doesn’t print out any output. It just gives you another command prompt. However, behind the scenes R has created a variable called sales and assign the value 30 to it. You can check that this has happened by asking R to print the variable on screen. And the simplest way to do that is to type the name of the variable and hit enter.

sales
[1] 30

So that’s nice to know. Anytime you can’t remember what R has got stored in a particular variable, you can just type the name of the variable and hit enter.

Okay, so now we know how to assign variables. Actually, there’s a bit more you should know. Firstly, one of the curious features of R is that there are several different ways of making assignments. In addition to the <- operator, we can also use -> and =, and it’s pretty important to understand the differences between them. Let’s start by considering ->, since that’s the easy one (we’ll discuss the use of = in Section 2.4.1). As you might expect from just looking at the symbol, it’s almost identical to <-. It’s just that the arrow (i.e., the assignment) goes from left to right. So if I wanted to define my sales variable using ->, I would write it like this:

30 -> sales

This has the same effect. And, just to be confusing, this also has the same effect:

sales = 30

… and so does this:

assign("sales", 30)
Caution

Apart from superficial differences, these various approaches to assignment are functionally identical. Despite this equivalence, you are strongly encouraged to use the <- operator. Because the use of <- is so conventional within the R language, those familiar with R will have a much more difficult time reading R code that uses anything else. Soon enough, you will too will be familiar with R and will thus come to expect the use of <-.

2.3.2 Calculations using variables

Okay, let’s get back to my original story. In my quest to become rich, I’ve written this textbook. To figure out how good a strategy is, I’ve started creating some variables in R. In addition to defining a sales variable that counts the number of copies I’m going to sell, I can also create a variable called royalty, indicating how much money I get per copy. Let’s say that my royalties are about $7 per book:

sales <- 30
royalty <- 7

The nice thing about variables (in fact, the whole point of having variables) is that we can do anything with a variable that we ought to be able to do with the information that it stores. That is, since R allows me to multiply 30 by 7

30 * 7
[1] 210

it also allows me to multiply sales by royalty

sales * royalty
[1] 210

As far as R is concerned, the sales * royalty command is the same as the 30 * 7 command. Not surprisingly, I can assign the output of this calculation to a new variable, which I’ll call revenue. And when we do this, the new variable revenue gets the value 35. So let’s do that, and then get R to print out the value of revenue so that we can verify that it’s done what we asked:

revenue <- sales * royalty
revenue
[1] 210

That’s fairly straightforward. A slightly more subtle thing we can do is reassign the value of my variable, based on its current value. For instance, suppose that one of my students (no doubt under the influence of psychotropic drugs) loves the book so much that he or she donates an extra $550 to me. The simplest way to capture this is by a command like this:

revenue <- revenue + 550
revenue
[1] 760

In this calculation, R has taken the old value of revenue (i.e., 210) and added 550 to that value, producing a value of 760 This new value is assigned to the revenue variable, overwriting its previous value. In any case, we now know that I’m expecting to make $760 off this. Pretty sweet, I thinks to myself. Or at least, that’s what I thinks until I do a few more calculation and work out what the implied hourly wage I’m making off this looks like.

2.3.3 Rules and conventions for naming variables

In the examples that we’ve seen so far, my variable names (sales and revenue) have just been English-language words written using lowercase letters. However, R allows a lot more flexibility when it comes to naming your variables, as the following list of rules illustrates:

  • Variable names can only use the upper case alphabetic characters A-Z as well as the lower case characters a-z. You can also include numeric characters 0-9 in the variable name, as well as the period . or underscore _ character. In other words, you can use SaL.e_s as a variable name (though I can’t think why you would want to), but you can’t use Sales?.
  • Variable names cannot include spaces: therefore my sales is not a valid name, but my.sales is.
  • Variable names are case sensitive: that is, Sales and sales are different variable names.
  • Variable names must start with a letter or a period. You can’t use something like _sales or 1sales as a variable name. You can use .sales as a variable name if you want, but it’s not usually a good idea. By convention, variables starting with a . are used for special purposes, so you should avoid doing so.
  • Variable names cannot be one of the reserved keywords. These are special names that R needs to keep “safe” from us mere users, so you can’t use them as the names of variables. The keywords are: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, and finally, NA_character_. Don’t feel especially obliged to memorize these: if you make a mistake and try to use one of the keywords as a variable name, R will complain about it like the whiny little automaton it is.

In addition to those rules that R enforces, there are some informal conventions that people tend to follow when naming variables. One of them you’ve already seen: i.e., don’t use variables that start with a period. But there are several others. You aren’t obliged to follow these conventions, and there are many situations in which it’s advisable to ignore them, but it’s generally a good idea to follow them when you can:

  • Use informative variable names. As a general rule, using meaningful names like sales and revenue is preferred over arbitrary ones like variable1 and variable2. Otherwise it’s very hard to remember what the contents of different variables are, and it becomes hard to understand what your commands actually do.
  • Use short variable names. Typing is a pain and no-one likes doing it. So we much prefer to use a name like sales over a name like sales.for.this.book.that.you.are.reading. Obviously there’s a bit of a tension between using informative names (which tend to be long) and using short names (which tend to be meaningless), so use a bit of common sense when trading off these two conventions.
  • Use one of the conventional naming styles for multi-word variable names. Suppose I want to name a variable that stores “my new salary”. Obviously I can’t include spaces in the variable name, so how should I do this? There are two main conventions that you sometimes see R users employing. First, there is “camel case” in which you use capital letters at the beginning of each constituent word (except the first one), which gives you myNewSalary as the variable name. Second, there is “snake case” in which you separate words using underscores, as in my_new_salary. Finally, you may also see some R users separating words using periods, which would give you my.new.salary.
Caution

Though you may sometimes see R users separating words within a variable name using periods (e.g., my.new.salary), it is syntactically ambiguous for those who know other programming languages. Many languages use periods to indicate hierarchical relationships (e.g., obj.func will refer to the func that belongs to obj). I thus strongly discourage such practices because it makes your code difficult to read/understand. Camel case is recommended.

2.4 Functions

The symbols +, -, *, etc. are examples of operators. As we’ve seen, you can do quite a lot of calculations just by using these operators. However, in order to do more advanced calculations (and later on, to do actual statistics), you’re going to need to start using functions. We’ll see more detail about functions and how they work later, but for now let’s just dive in and use a few. To get started, suppose I wanted to take the square root of 225. The square root, in case your high school math is a bit rusty, is just the opposite of squaring a number. So, for instance, since “5 squared is 25” I can say that “5 is the square root of 25”. This is the usual notation:

\[ \sqrt{25} = 5 \]

Sometimes you’ll also see it written like this:

\(25^{0.5} = 5\)

This second way of writing it is kind of useful to “remind” you of the mathematical fact that “square root of \(x\)” is actually the same as “raising \(x\) to the power of 0.5”. Personally, I’ve never found this to be terribly meaningful psychologically, though I have to admit it’s quite convenient mathematically. Anyway, it’s not important. What is important is that you remember what a square root is, since we’re going to need it later on.

You may be able to calculate the square root of 25 in your head. But it gets more difficult when the numbers get bigger, and pretty much impossible if they’re not whole numbers. This is where something like R comes in very handy. Let’s say I wanted to calculate \(\sqrt{225}\), the square root of 225. There’s two ways I could do this using R. First, since the square root of 255 is the same thing as raising 225 to the power of 0.5, we could use the power operator ^, just like we did earlier:

225^0.5
[1] 15

However, there’s a second way that we can do this, since R also provides a square root function, sqrt(). To calculate the square root of 255 using this function, what I do is insert the number 225 in the parentheses. That is, the command I type is this:

sqrt(225)
[1] 15

As you might expect from our previous discussion, the spaces in between the parentheses are purely cosmetic. We could have typed sqrt(225) or sqrt( 225 ) and gotten the same result. When we use a function to do something, we generally refer to this as calling the function, and the values that we type into the function (there can be more than one) are referred to as the arguments of that function.

Obviously, the sqrt() function doesn’t really give us any new functionality, since we already knew how to do square root calculations by using the power operator ^, though it maybe be more explicit, clearer, and thus easier to read to use sqrt(). However, there are lots of other functions in R: in fact, almost everything of interest that I’ll talk about in this book is an R function of some kind. For example, one function that we will need to use in this book is the absolute value function. Compared to the square root function, it’s extremely simple: it just converts negative numbers to positive numbers, and leaves positive numbers alone. Mathematically, the absolute value of \(x\) is written \(|x|\) or sometimes \(\mbox{abs}(x)\). Calculating absolute values in R is pretty easy, since R provides the abs() function that you can use for this purpose. When you feed it a positive number…

abs(21)
[1] 21

the absolute value function does nothing to it at all. But when you feed it a negative number, it spits out the positive version of the same number, like this:

abs(-13)
[1] 13

In all honesty, there’s nothing that the absolute value function does that you couldn’t do just by looking at the number and erasing the minus sign if there is one. However, there’s a few places later in the book where we have to use absolute values, so I thought it might be a good idea to explain the meaning of the term early on.

Before moving on, it’s worth noting that – in the same way that R allows us to put multiple operations together into a longer command, like 1 + (2*4) for instance – it also lets us put functions together and even combine functions with operators if we so desire. For example, the following is a perfectly legitimate command:

sqrt( 1 + abs(-8) )
[1] 3

When R executes this command, starts out by calculating the value of abs(-8), which produces an intermediate value of 8. Having done so, the command simplifies to sqrt( 1 + 8 ). To solve the square root it first needs to add 1 + 8 to get 9, at which point it evaluates sqrt(9), and so it finally outputs a value of 3.

2.4.1 Function arguments, their names and their defaults

There’s two more fairly important things that you need to understand about how functions work in R, and that’s the use of “named” arguments, and “default values” for arguments. Not surprisingly, that’s not to say that this is the last we’ll hear about how functions work, but they are the last things we desperately need to discuss in order to get you started. To understand what these two concepts are all about, I’ll introduce another function. The round() function can be used to round some value to the nearest whole number. For example, I could type this:

round(3.1415)
[1] 3

Pretty straightforward, really. However, suppose I only wanted to round it to two decimal places: that is, I want to get 3.14 as the output. The round() function supports this, by allowing you to input a second argument to the function that specifies the number of decimal places that you want to round the number to. In other words, I could do this:

round(3.14165, 2)
[1] 3.14

What’s happening here is that I’ve specified two arguments: the first argument is the number that needs to be rounded (i.e., 3.14165), the second argument is the number of decimal places that it should be rounded to (i.e., 2), and the two arguments are separated by a comma. In this simple example, it’s quite easy to remember which one argument comes first and which one comes second, but for more complicated functions this is not easy. Fortunately, most R functions make use of argument names. For the round() function, for example the number that needs to be rounded is specified using the x argument, and the number of decimal points that you want it rounded to is specified using the digits argument. Because we have these names available to us, we can specify the arguments to the function by name. We do so like this:

round(x=3.1415, digits=2)
[1] 3.14

Notice that this is kind of similar in spirit to variable assignment, except that = is used here, rather than <-. In both cases we’re specifying specific values to be associated with a label. However, there are some differences between what we were doing earlier on when creating variables, and what we’re doing here when specifying arguments, and so as a consequence it’s important that you use = in this context.

As you can see, specifying the arguments by name involves a lot more typing, but it’s also explicit and thus a lot easier to read. Because of this, the commands in this book will usually specify arguments by name, since that makes it clearer to you what I’m doing. However, one important thing to note is that when specifying the arguments using their names, it doesn’t matter what order you type them in. But if you don’t use the argument names, then you have to input the arguments in the correct order. In other words, these three commands all produce the same output…

round(x=3.1415, 2)
[1] 3.14
round(x=3.1415, digits=2)
[1] 3.14
round(digits=2, x=3.1415)
[1] 3.14

but this one does not…

round(2, 3.14165)
[1] 2

How do you find out what the correct order is? There’s a few different ways, but the easiest one is to look at the help documentation for the function (e.g., ? round). However, if you’re ever unsure, it’s probably best to actually type in the argument name.

Okay, so that’s the first thing I said you’d need to know: argument names. The second thing you need to know about is default values. Notice that the first time I called the round() function I didn’t actually specify the digits argument at all, and yet R somehow knew that this meant it should round to the nearest whole number. How did that happen? The answer is that the digits argument has a default value of 0, meaning that if you decide not to specify a value for digits then R will act as if you had typed digits = 0. This is quite handy: the vast majority of the time when you want to round a number you want to round it to the nearest whole number, and it would be pretty annoying to have to specify the digits argument every single time. On the other hand, sometimes you actually do want to round to something other than the nearest whole number, and it would be even more annoying if R didn’t allow this! Thus, by having digits = 0 as the default value, we get the best of both worlds.

2.5 Storing many numbers as a vector

At this point we’ve covered functions in enough detail to get us safely through most of the rest of the book, so let’s return to our discussion of variables. When variables were introduced in Section 2.3 we saw how we can use variables to store a single number. In this section, we’ll extend this idea and look at how to store multiple numbers within the one variable. In R, a variable stores multiple values is called a vector. So let’s create one.

2.5.1 Creating a vector

Let’s stick to my silly “get rich quick by textbook writing” example. Suppose the textbook company (if there actually was one, that is) sends sales data on a monthly basis. Since my class start in late February, we might expect most of the sales to occur towards the start of the year. Let’s suppose that I have 100 sales in February, 200 sales in March and 50 sales in April, and no other sales for the rest of the year. What I would like to do is have a variable – let’s call it sales.by.month – that stores all this sales data. The first number stored should be 0 since I had no sales in January, the second should be 100, and so on. The simplest way to do this in R is to use the combine function, c(). To do so, all we have to do is type all the numbers you want to store in a comma separated list, like this:

sales.by.month <- c(0, 100, 200, 50, 0, 0, 0, 0, 0, 0, 0, 0)
sales.by.month
 [1]   0 100 200  50   0   0   0   0   0   0   0   0

To use the correct terminology here, we have a single variable here called sales.by.month: this variable is a vector that consists of 12 elements.

2.5.2 A handy digression

Now that we’ve learned how to put information into a vector, the next thing to understand is how to pull that information back out again. However, before I do so it’s worth taking a slight detour. If you’ve been following along, typing all the commands into R yourself, it’s possible that the output that you saw when we printed out the sales.by.month vector was slightly different to what I showed above. This would have happened if the window (or the RStudio panel) that contains the R console is really, really narrow. If that were the case, you might have seen output that looks something like this:

sales.by.month
 [1]   0 100 200  50
 [5]   0   0   0   0
 [9]   0   0   0   0

Because there wasn’t much room on the screen, R has printed out the results over three lines. But that’s not the important thing to notice. The important point is that the first line has a [1] in front of it, whereas the second line starts with [5] and the third with [9]. It’s pretty clear what’s happening here. For the first row, R has printed out the 1st element through to the 4th element, so it starts that row with a [1]. For the second row, R has printed out the 5th element of the vector through to the 8th one, and so it begins that row with a [5] so that you can tell where it’s up to at a glance. It might seem a bit odd to you that R does this, but in some ways it’s a kindness, especially when dealing with larger data sets!

2.5.3 Getting information out of vectors

To get back to the main story, let’s consider the problem of how to get information out of a vector. At this point, you might have a sneaking suspicion that the answer has something to do with the [1] and [9] things that R has been printing out. And of course you are correct. Suppose I want to pull out the February sales data only. February is the second month of the year, so let’s try this:

sales.by.month[2]
[1] 100

Yep, that’s the February sales all right. But there’s a subtle detail to be aware of here: notice that R outputs [1] 100, not [2] 100. This is because R is being extremely literal. When we typed in sales.by.month[2], we asked R to find exactly one thing, and that one thing happens to be the second element of our sales.by.month vector. So, when it outputs [1] 100 what R is saying is that the first number that we just asked for is 100. This behavior makes more sense when you realize that we can use this trick to create new variables. For example, I could create a february.sales variable like this:

february.sales <- sales.by.month[2]
february.sales
[1] 100

Obviously, the new variable february.sales should only have one element and so when I print it out this new variable, the R output begins with a [1] because 100 is the value of the first (and only) element of february.sales. The fact that this also happens to be the value of the second element of sales.by.month is irrelevant. We’ll pick this topic up again shortly (Section 2.9).

2.5.4 Altering the elements of a vector

Sometimes you’ll want to change the values stored in a vector. Imagine my surprise when the publisher rings me up to tell me that the sales data for May are wrong. There were actually an additional 25 books sold in May, but there was an error or something so they hadn’t told me about it. How can I fix my sales.by.month variable? One possibility would be to assign the whole vector again from the beginning, using c(). But that’s a lot of typing. Also, it’s a little wasteful: why should R have to redefine the sales figures for all 12 months, when only the 5th one is wrong? Fortunately, we can tell R to change only the 5th element, using this trick:

sales.by.month[5] <- 25
sales.by.month
 [1]   0 100 200  50  25   0   0   0   0   0   0   0

Another way to edit variables is to use the edit() or fix() functions. I won’t discuss them in detail right now, but you can check them out on your own.

2.5.5 Useful things to know about vectors

Before moving on, I want to mention a couple of other things about vectors. Firstly, you often find yourself wanting to know how many elements there are in a vector (usually because you’ve forgotten). You can use the length() function to do this. It’s quite straightforward:

length(x = sales.by.month)
[1] 12

Secondly, you often want to alter all of the elements of a vector at once. For instance, suppose I wanted to figure out how much money I made in each month. Since I’m earning an exciting $7 per book (no seriously, that’s actually pretty close to what authors get on the very expensive textbooks that you’re expected to purchase), what I want to do is multiply each element in the sales.by.month vector by 7. R makes this pretty easy, as the following example shows:

sales.by.month * 7
 [1]    0  700 1400  350  175    0    0    0    0    0    0    0

In other words, when you multiply a vector by a single number, all elements in the vector get multiplied. The same is true for addition, subtraction, division and taking powers. So that’s neat. On the other hand, suppose I wanted to know how much money I was making per day, rather than per month. Since not every month has the same number of days, I need to do something slightly different. Firstly, I’ll create two new vectors:

days.per.month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
profit <- sales.by.month * 7

Obviously, the profit variable is the same one we created earlier, and the days.per.month variable is pretty straightforward. What I want to do is divide every element of profit by the corresponding element of days.per.month. Again, R makes this pretty easy:

profit / days.per.month
 [1]  0.000000 25.000000 45.161290 11.666667  5.645161  0.000000  0.000000
 [8]  0.000000  0.000000  0.000000  0.000000  0.000000

I still don’t like all those zeros, but that’s not what matters here. Notice that the second element of the output is 25, because R has divided the second element of profit (i.e. 700) by the second element of days.per.month (i.e. 28). Similarly, the third element of the output is equal to 1400 divided by 31, and so on. We’ll talk more about calculations involving vectors later on, but that’s enough detail for now.

2.6 Storing text data

A lot of the time your data will be numeric in nature, but not always. Sometimes your data really needs to be described using text, not using numbers. To address this, we need to consider the situation where our variables store text. To create a variable that stores the word “hello”, we can type this:

greeting <- "hello"
greeting
[1] "hello"

When interpreting this, it’s important to recognise that the quote marks here aren’t part of the string itself. They’re just something that we use to make sure that R knows to treat the characters that they enclose as a piece of text data, known as a character string. In other words, R treats "hello" as a string containing the word “hello”; but if I had typed hello instead, R would go looking for a variable by that name! You can also use 'hello' to specify a character string.

Okay, so that’s how we store the text. Next, it’s important to recognise that when we do this, R stores the entire word "hello" as a single element: our greeting variable is not a vector of five different letters. Rather, it has only the one element, and that element corresponds to the entire character string "hello". To illustrate this, if I actually ask R to find the first element of greeting, it prints the whole string:

greeting[1]
[1] "hello"

Of course, there’s no reason why I can’t create a vector of character strings. For instance, if we were to continue with the example of my attempts to look at the monthly sales data for my book, one variable I might want would include the names of all 12 months. To do so, I could type in a command like this

months <- c("January", "February", "March", "April", "May", "June",
            "July", "August", "September", "October", "November", 
            "December")

This is a character vector containing 12 elements, each of which is the name of a month. So if I wanted R to tell me the name of the fourth month, all I would do is this:

months[4]
[1] "April"

2.6.1 Working with text

Working with text data is somewhat more complicated than working with numeric data. There is much to discuss here, but for purposes of the current chapter we only need this bare bones sketch. The only other thing I want to do before moving on is show you an example of a function that can be applied to text data. So far, most of the functions that we have seen (i.e., sqrt(), abs() and round()) only make sense when applied to numeric data (e.g., you can’t calculate the square root of “hello”), and we’ve seen one function that can be applied to pretty much any variable or vector (i.e., length()). So it might be nice to see an example of a function that can be applied to text.

The function I’m going to introduce you to is called nchar(), and what it does is count the number of individual characters that make up a string. Recall earlier that when we tried to calculate the length() of our greeting variable it returned a value of 1: the greeting variable contains only the one string, which happens to be "hello". But what if I want to know how many letters there are in the word? Sure, I could count them, but that’s boring, and more to the point it’s a terrible strategy if what I wanted to know was the number of letters in War and Peace. That’s where the nchar() function is helpful:

nchar( x = greeting )
[1] 5

That makes sense, since there are in fact 5 letters in the string "hello". Better yet, you can apply nchar() to whole vectors. So, for instance, if I want R to tell me how many letters there are in the names of each of the 12 months, I can do this:

nchar( x = months )
 [1] 7 8 5 5 3 4 4 6 9 7 8 8

So that’s nice to know. The nchar() function can do a bit more than this, and there’s a lot of other functions that you can do to extract more information from text or do all sorts of fancy things. However, the goal here is not to teach any of that! The goal right now is just to see an example of a function that actually does work when applied to text.

2.7 Storing “true or false” data

Time to move onto a third kind of data. A key concept in that a lot of R relies on is the idea of a logical value or (Boolean value). A logical value is an assertion about whether something is true or false. This is implemented in R in a pretty straightforward way. There are two logical values, namely TRUE and FALSE. Despite the simplicity, a logical values are very useful things. Let’s see how they work.

2.7.1 Assessing mathematical truths

In George Orwell’s classic book 1984, one of the slogans used by the totalitarian Party was “two plus two equals five”, the idea being that the political domination of human freedom becomes complete when it is possible to subvert even the most basic of truths. It’s a terrifying thought, especially when the protagonist Winston Smith finally breaks down under torture and agrees to the proposition. “Man is infinitely malleable”, the book says. I’m pretty sure that this isn’t true of humans but it’s definitely not true of R. R is not infinitely malleable. It has rather firm opinions on the topic of what is and isn’t true, at least as regards basic mathematics. If I ask it to calculate 2 + 2, it always gives the same answer, and it’s not bloody 5:

2 + 2
[1] 4

Of course, so far R is just doing the calculations. I haven’t asked it to explicitly assert that \(2+2 = 4\) is a true statement. If I want R to make an explicit judgement, I can use a command like this:

2 + 2 == 4
[1] TRUE

What I’ve done here is use the equality operator, ==, to force R to make a “true or false” judgement. Okay, let’s see what R thinks of the Party slogan:

2+2 == 5
[1] FALSE

Booyah! Freedom and ponies for all! Or something like that. Anyway, it’s worth having a look at what happens if I try to force R to believe that two plus two is five by making an assignment statement like 2 + 2 = 5 or 2 + 2 <- 5. When I do this, here’s what happens:

2 + 2 = 5
Error in 2 + 2 = 5: target of assignment expands to non-language object

R doesn’t like this very much. It recognizes that 2 + 2 is not a variable (that’s what the “non-language object” part is saying), and it won’t let you try to “reassign” it. While R is pretty flexible, and actually does let you do some quite remarkable things to redefine parts of R itself, there are just some basic, primitive truths that it refuses to give up. It won’t change the laws of addition, and it won’t change the definition of the number 2.

That’s probably for the best.

2.7.2 Logical operations

So now we’ve seen logical operations at work, but so far we’ve only seen the simplest possible example. You probably won’t be surprised to discover that we can combine logical operations with other operations and functions in a more complicated way, like this:

3*3 + 4*4 == 5*5
[1] TRUE

or this

sqrt( 25 ) == 5
[1] TRUE

Not only that, but as Table Table 2.1 illustrates, there are several other logical operators that you can use, corresponding to some basic mathematical concepts.

Table 2.1: ?(caption)

(a) Some logical operators. Technically I should be calling these “binary relational operators”, but quite frankly I don’t want to. It’s my book so no-one can make me.
operation operator example input answer
less than < 2 < 3 TRUE
less than or equal to <= 2 <= 2 TRUE
greater than > 2 > 3 FALSE
greater than or equal to >= 2 >= 2 TRUE
equal to == 2 == 3 FALSE
not equal to != 2 != 3 TRUE

Hopefully these are all pretty self-explanatory: for example, the less than operator < checks to see if the number on the left is less than the number on the right. If it’s less, then R returns an answer of TRUE:

99 < 100
[1] TRUE

but if the two numbers are equal, or if the one on the right is larger, then R returns an answer of FALSE, as the following two examples illustrate:

100 < 100
[1] FALSE
100 < 99
[1] FALSE

In contrast, the less than or equal to operator <= will do exactly what it says. It returns a value of TRUE if the number of the left hand side is less than or equal to the number on the right hand side. So if we repeat the previous two examples using <=, here’s what we get:

100 <= 100
[1] TRUE
100 <= 99
[1] FALSE

And at this point I hope it’s pretty obvious what the greater than operator > and the greater than or equal to operator >= do! Next on the list of logical operators is the not equal to operator != which – as with all the others – does what it says it does. It returns a value of TRUE when things on either side are not identical to each other. Therefore, since \(2+2\) isn’t equal to \(5\), we get:

2 + 2 != 5
[1] TRUE

We’re not quite done yet. There are three more logical operations that are worth knowing about, listed in Table Table 2.2.

Table 2.2: ?(caption)

(a) Some more logical operators.
operation operator example input answer
not ! !(1==1) FALSE
or | (1==1) | (2==3) TRUE
and & (1==1) & (2==3) FALSE

These are the not operator !, the and operator &, and the or operator |. Like the other logical operators, their behavior is more or less exactly what you’d expect given their names. For instance, if I ask you to assess the claim that either \(2+2 = 4\) or \(2+2 = 5\), then you’d say that claim is true. Since it’s an “either-or” statement, all we need is for one of the two parts to be true. That’s what the | operator does:

(2+2 == 4) | (2+2 == 5)
[1] TRUE

On the other hand, if I ask you to assess the claim that both \(2+2 = 4\) and \(2+2 = 5\), then you’d say that claim is false. Since this is an and statement we need both parts to be true. And that’s what the & operator does:

(2+2 == 4) & (2+2 == 5)
[1] FALSE

Finally, there’s the not operator, which is simple but annoying to describe in English. If I ask you to assess my claim that “it is not true that \(2+2 = 5\)”, then you would say that claim is true; because my claim is that “\(2+2 = 5\) is false”. And I’m right. If we write this as an R command we get this:

! (2+2 == 5)
[1] TRUE

In other words, since 2+2 == 5 is a FALSE statement, it must be the case that !(2+2 == 5) is a TRUE one. Essentially, what we’ve really done is claim that “not false” is the same thing as “true”. Obviously, this isn’t really quite right in real life. But logical values encode a black and white world: any given logical statement is either true or false. No shades of gray are allowed. We can actually see this much more explicitly, like this:

! FALSE
[1] TRUE

Of course, in our \(2+2 = 5\) example, we didn’t really need to use “not” ! and “equals to” == as two separate operators. We could have just used the “not equals to” operator != like this:

2+2 != 5
[1] TRUE

But there are many situations where you really do need to use the ! operator. We’ll see some later on.

2.7.3 Storing and using logical data

Up to this point, I’ve introduced numeric data (Section 2.3 and Section 2.5) and character data (Section 2.6). So you might not be surprised to discover that these TRUE and FALSE values that R has been producing are actually a third kind of data, called logical data. That is, when I asked R if 2 + 2 == 5 and it said [1] FALSE in reply, it was actually producing information that we can store in variables. For instance, I could create a variable called is.the.Party.correct, which would store R’s opinion:

is.the.Party.correct <- 2 + 2 == 5
is.the.Party.correct
[1] FALSE

Alternatively, you can assign the value directly, by typing TRUE or FALSE in your command. Like this:

is.the.Party.correct <- FALSE
is.the.Party.correct
[1] FALSE

Better yet, because it’s kind of tedious to type TRUE or FALSE over and over again, R provides you with a shortcut: you can use T and F instead (but it’s case sensitive: t and f won’t work).

2.8 TRUE and FALSE

TRUE and FALSE are reserved keywords in R, so you can trust that they always mean what they say they do. Unfortunately, the shortcut versions T and F do not have this property. It’s even possible to create variables that set up the reverse meanings, by typing commands like T <- FALSE and F <- TRUE. This is kind of insane, and something that is generally thought to be a design flaw in R. Anyway, the long and short of it is that it’s safer to use TRUE and FALSE.:::

So this works:

is.the.Party.correct <- F
is.the.Party.correct
[1] FALSE

but this doesn’t:

is.the.Party.correct <- f
Error in eval(expr, envir, enclos): object 'f' not found

2.8.1 Vectors of logicals

The next thing to mention is that you can store vectors of logical values in exactly the same way that you can store vectors of numbers (Section 2.5) and vectors of text data (Section 2.6). Again, we can define them directly via the c() function, like this:

x <- c(TRUE, TRUE, FALSE)
x
[1]  TRUE  TRUE FALSE

or you can produce a vector of logicals by applying a logical operator to a vector. This might not make a lot of sense to you, so let’s unpack it slowly. First, let’s suppose we have a vector of numbers (i.e., a “non-logical vector”). For instance, we could use the sales.by.month vector that we were using in Section 2.5. Suppose I wanted R to tell me, for each month of the year, whether I actually sold a book in that month. I can do that by typing this:

sales.by.month > 0
 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

and again, I can store this in a vector if I want, as the example below illustrates:

any.sales.this.month <- sales.by.month > 0
any.sales.this.month
 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

In other words, any.sales.this.month is a logical vector whose elements are TRUE only if the corresponding element of sales.by.month is greater than zero. For instance, since I sold zero books in January, the first element is FALSE.

2.8.2 Applying logical operation to text

In a moment (Section 2.9) I’ll show you why these logical operations and logical vectors are so handy, but before I do so I want to very briefly point out that you can apply them to text as well as to logical data. It’s just that we need to be a bit more careful in understanding how R interprets the different operations. In this section I’ll talk about how the equal to operator == applies to text, since this is the most important one. Obviously, the not equal to operator != gives the exact opposite answers to == so I’m implicitly talking about that one too, but I won’t give specific commands showing the use of !=. There are a variety of other operators, but those will do for now.

Okay, let’s see how it works. In one sense, it’s very simple. For instance, I can ask R if the word "cat" is the same as the word "dog", like this:

"cat" == "dog"
[1] FALSE

That’s pretty obvious, and it’s good to know that even R can figure that out. Similarly, R does recognize that a "cat" is a "cat":

"cat" == "cat"
[1] TRUE

Again, that’s exactly what we’d expect. However, what you need to keep in mind is that R is not at all tolerant when it comes to grammar and spacing. If two strings differ in any way whatsoever, R will say that they’re not equal to each other, as the following examples indicate:

" cat" == "cat"
[1] FALSE
"cat" == "CAT"
[1] FALSE
"cat" == "c a t"
[1] FALSE

2.9 Indexing vectors

One last thing to add before finishing up this chapter. So far, whenever I’ve had to get information out of a vector, all I’ve done is typed something like months[4]; and when I do this R prints out the fourth element of the months vector. In this section, I’ll show you two additional tricks for getting information out of the vector.

2.9.1 Extracting multiple elements

One very useful thing we can do is pull out more than one element at a time. In the previous example, we only used a single number (i.e., 2) to indicate which element we wanted. Alternatively, we can use a vector. So, suppose I wanted the data for February, March and April. What I could do is use the vector c(2,3,4) to indicate which elements I want R to pull out. That is, I’d type this:

sales.by.month[ c(2,3,4) ]
[1] 100 200  50

Notice that the order matters here. If I asked for the data in the reverse order (i.e., April first, then March, then February) by using the vector c(4,3,2), then R outputs the data in the reverse order:

sales.by.month[ c(4,3,2) ]
[1]  50 200 100

A second thing to be aware of is that R provides you with handy shortcuts for very common situations. For instance, suppose that I wanted to extract everything from the 2nd month through to the 8th month. One way to do this is to do the same thing I did above, and use the vector c(2,3,4,5,6,7,8) to indicate the elements that I want. That works just fine

sales.by.month[ c(2,3,4,5,6,7,8) ]
[1] 100 200  50  25   0   0   0

but it’s kind of a lot of typing. To help make this easier, R lets you use 2:8 as shorthand for c(2,3,4,5,6,7,8), which makes things a lot simpler. First, let’s just check that this is true:

2:8
[1] 2 3 4 5 6 7 8

Next, let’s check that we can use the 2:8 shorthand as a way to pull out the 2nd through 8th elements of sales.by.months:

sales.by.month[2:8]
[1] 100 200  50  25   0   0   0

So that’s kind of neat.

2.9.2 Logical indexing

At this point, I can introduce an extremely useful tool called logical indexing. In the last section, I created a logical vector any.sales.this.month, whose elements are TRUE for any month in which I sold at least one book, and FALSE for all the others. However, that big long list of TRUEs and FALSEs is a little bit hard to read, so what I’d like to do is to have R select the names of the months for which I sold any books. Earlier on, I created a vector months that contains the names of each of the months. This is where logical indexing is handy. What I need to do is this:

months[ sales.by.month > 0 ]
[1] "February" "March"    "April"    "May"     

To understand what’s happening here, it’s helpful to notice that sales.by.month > 0 is the same logical expression that we used to create the any.sales.this.month vector in the last section. In fact, I could have just done this:

months[ any.sales.this.month ]
[1] "February" "March"    "April"    "May"     

and gotten exactly the same result. In order to figure out which elements of months to include in the output, what R does is look to see if the corresponding element in any.sales.this.month is TRUE. Thus, since element 1 of any.sales.this.month is FALSE, R does not include "January" as part of the output; but since element 2 of any.sales.this.month is TRUE, R does include "February" in the output. Note that there’s no reason why I can’t use the same trick to find the actual sales numbers for those months. The command to do that would just be this:

sales.by.month [ sales.by.month > 0 ]
[1] 100 200  50  25

In fact, we can do the same thing with text. Here’s an example. Suppose that – to continue the saga of the textbook sales – I later find out that the bookshop only had sufficient stocks for a few months of the year. They tell me that early in the year they had "high" stocks, which then dropped to "low" levels, and in fact for one month they were "out" of copies of the book for a while before they were able to replenish them. Thus I might have a variable called stock.levels which looks like this:

stock.levels<-c("high", "high", "low", "out", "out", "high",
                "high", "high", "high", "high", "high", "high")

stock.levels
 [1] "high" "high" "low"  "out"  "out"  "high" "high" "high" "high" "high"
[11] "high" "high"

Thus, if I want to know the months for which the bookshop was out of my book, I could apply the logical indexing trick, but with the character vector stock.levels, like this:

months[stock.levels == "out"]
[1] "April" "May"  

Alternatively, if I want to know when the bookshop was either low on copies or out of copies, I could do this:

months[stock.levels == "out" | stock.levels == "low"]
[1] "March" "April" "May"  

or this

months[stock.levels != "high" ]
[1] "March" "April" "May"  

Either way, I get the answer I want.

At this point, I hope you can see why logical indexing is such a useful thing. It’s a very basic, yet very powerful way to manipulate data. Subsequent chapters will talk a lot more about how to manipulate data, since it’s a critical skill for real world research that is often overlooked in introductory statistics courses It does take a bit of practice to become completely comfortable using logical indexing, so it’s a good idea to play around with these sorts of commands. Try creating a few different variables of your own, and then ask yourself questions like “how do I get R to spit out all the elements that are [blah]”. Practice makes perfect, and it’s only by practicing logical indexing that you’ll perfect the art of yelling frustrated insults at your computer.

2.10 Exercises

  1. Compute \(42+17\)
  2. Compute \(8-3\)
  3. Compute \((8-3)^2\)
  4. Compute \(\frac{42+17}{(8-3)^2}\)
  5. Define a vector containing the numbers 29, 63, 7, 23, 84, 10 and 9.
  6. Imagine this vector contains counts in units of months (29 months, 63 months, etc.). Compute a new vector that contains the same measurement but not in units of years. That is, divide all the entries in the previous vector by 12. Print these new measurements to the console.
  7. Create two strings (character vectors). One should be "R rules!" and the other should be "r rules!". Determine whether these two vectors are equal.
  8. Create a vector that consists of 6 values: 3 even and 3 odd.
  9. Modify the third value in this vector so that it is now double it’s original value.
  10. Modify this vector so that all the odd numbers are removed.
  11. Calculate the sum of the values currently in the vector.
  12. Imagine a study in which participants must weigh less than 90 kg and be between 18 and 60 years of age. Define a vector of weights as weight <- c(80, 75, 92, 105, 60) and a vector of ages age <- c(50, 17, 39, 27, 90). Now calculate a vector of logical values, each of which indicates whether the corresponding participant is eligible for the study.
  13. Calculate the sum of 0.1 and 0.2.
  14. Calculate 10 times the sum of 0.1 and 0.2.
  15. Determine whether 10 times the sum of 0.1 and 0.2 is equal to 3.