Intro

The first contact with R might feel a little overwhelming, especially if you haven’t done any programming before. Some things are fairly easy to understand but others do not come naturally to people. To help you with the latter group, here are some tips on how to approach talking to computers.

 

Jargon warning

It is unavoidable to include some technical language. At first, this may feel a little obscure but the vocab needed to understand this document is not large. Besides, learning the jargon will make it possible for you to put into words what it is you need to do and, if you’re struggling, ask for help in places such as Stack Overflow.

If you come across an unfamiliar term, feel free to consult the dictionary at the bottom of this document.

We’re smarter than them (for now)

Talking to a computer is can sometimes feel a little difficult. They are highly proficient at formal languages: they understands the vocab, grammar, and syntax perfectly well. They can do many complex things you ask them to do, so long as you are formal, logical, precise, and unambiguous. If you are not, they won’t be able to help you or worse, will do something wrong. When it comes to pragmatics, they are lost: they doesn’t understand context, metaphor, shorthand, or sarcasm. The truth is that they just aren’t all that smart so they need to be told what to do in a unambiguous way.

Keep this in mind whenever you’re working with R. Don’t take anything for granted, don’t rely on R to make any assumptions about what you are telling it. Certainly don’t try to be funny and expect R to appreciate it! It takes a little practice to get into the habit of translating what you want to do into the dry, formal language of logic and algorithms but, on the flip side, once you get the knack of it, there will never be any misunderstanding on the part of the machine. With people, there will always be miscommunication, no matter how clear you try to be, because people are very sensitive to context, paralinguistic and non-verbal cues, and the pragmatics of a language. Computers don’t care about any of that and thus, if they’re not doing what you want them to, the problem is in the sender, not the receiver.1 That is exactly why they’re built like that!

Instead of thinking about computers as scary, abstruse, or boring, learn to appreciate this quality of precision as beautiful. That way, you’ll get much more enjoyment out of working with them.

Think algorithmically

By algorithmically, we mean in a precise step-by-step fashion where each action is equivalent to the smallest possible increment. So, if you want a machine to make you a cup of tea, instead of saying: “Oi R! Fetch me a cuppa!”, you need to break everything down. Something like this:

  1. Go to kitchen
  2. If no kitchen, jump to 53
  3. Else get kettle
  4. If no kettle, jump to 53
  5. Check how much water in kettle
  6. If is more or equal than 500 ml of water in kettle, jump to 15
  7. Place kettle under cold tap
  8. Open kettle lid
  9. Turn cold tap on
  10. Check how much water in kettle
  11. If there is more than 500 ml of water in kettle, jump to 13
  12. Jump to 10
  13. Turn cold tap off
  14. Close kettle lid
  15. If kettle is on base, jump to 17
  16. Place kettle on base
  17. If base is plugged in mains, jump to 19
  18. Plug base in mains
  19. Turn kettle on
  20. Get cup
  21. If no cup, jump to 53
  22. Get tea bag # yeah, we’re not posh
  23. If no tea bag, jump to 53
  24. Put tea bag in cup
  25. If kettle has boiled, jump to 27
  26. Jump to 25
  27. Start pouring water from kettle to cup
  28. Check how much water in cup
  29. If there is more than 225 ml of water in cup, jump to 31
  30. Jump to 28
  31. Stop pouring water from kettle
  32. Return kettle on base
  33. Wait 4 minutes
  34. Remove tea bag from cup
  35. Place tea bag by edge of sink # this is obviously your partner’s/flatmate’s program
  36. Open fridge
  37. Search fridge
  38. If full fat milk is not in fridge, jump to 41
  39. Get full fat milk # full fat should be any sensible person’s top choice. True fact!
  40. Jump to 43
  41. If semi-skimmed milk is not in fridge, jump to 51 # skimmed is not even an option!
  42. Get semi-skimmed milk
  43. Take lid of milk
  44. Start pouring milk in cup
  45. Check how much liquid in cup
  46. If more or equal to 250 ml liquid in cup, jump to 48
  47. Jump to 45
  48. Stop pouring milk
  49. Put lid on milk
  50. Return milk to fridge
  51. Close fridge # yes, we kept it open! Fewer steps (I’m not lazy, I’m optimised!)
  52. Bring cup # no need to stir
  53. END

… so there you go, a nice cuppa in 53 easy steps!

Obviously, this is not real code but something called pseudocode: a program-like set of instructions expressed in a more-or-less natural language. Surely, we could get even more detailed, e.g., by specifying what kind of tea we want, but 53 steps is plenty. Notice, however, that, just like in actual code, the entire process is expressed in a sequential manner. This sometimes means that you need to think about how to say things that would be very easy to communicate to a person (e.g., “Wait till the kettle boils”) in terms of well-defined repeatable steps. In our example, this is done with a loop:

  1. If kettle has boiled, jump to 27
  2. Jump to 25
  3. Start pouring water from kettle to cup

The program will check if the kettle has boiled and, if it has, it will start pouring water into the cup. If it hasn’t boiled, it will jump back to 25. and check again if it has boiled. This is equivalent to saying “wait till the kettle boils”.

Note that in actual R code, certain expressions, such as jump to (or, more frequently in some older programming languages goto) are not used so you would have to find other ways around the problem, were you to write a tea-making program in R. However, this example illustrates the sort of thinking mode you need to get into when talking to computers. Especially when it comes to data processing, no matter what it is you want your computer to do, you need to be able to convey it in a sequence of simple steps.

Don’t be scared

It is quite normal to be a little over-cautious when you first start working with a programming language. Rest assured that, when it comes to R, there is very little chance you’ll break your computer or damage your data, unless you set out to do so and know what you’re doing. When you read in your data, R will copy it to the computer’s memory and only work with this copy. The original data file will remain untouched. So please, be adventurous, try things out, experiment, break thingst’s the best way to learn programming and understand the principles and the logic behind the code.

 

RStudio

RStudio is what is called an Integrated Development Environment for R. It is dependent on R but separate from it. There are several ways of using R but RStudio is arguably the most popular and convenient. Let’s have a look at it.

 

The “heart of R” is the Console window. This is where instructions are sent to R, and its responses are given. The console is, almost exclusively, the way of talking to R in RStudio.

The Information area (all of the right-hand side of RStudio) shows you useful information about the state of your project. At the moment, you can see some relevant files in the bottom pane, and an empty “Global Environment” at the top. The global environment is a virtual storage of all objects you create in R. So, for example, if you read in some data into R, this is where they will be put and where R will look for them if you tell it to manipulate or analyse the data.

Finally, the Editor is where you write more complicated scripts without having to run each command. Each script, when saved, is just a text file with some added bells and whistles. There’s nothing special about it. Indeed, if you wanted to, you could write your script in any plain text editor, save it, change its extension from .txt to .R and open it in RStudio. There is no need to do this but you could. When you run such a script file, it gets interpreted by R in a line by line fashion. This means that your data cleaning, processing, visualisation, and analysis needs to be written up in sequence otherwise R won’t be able to make sense of your script. Since the script is basically a plain text file (with some nice colours added by RStudio to improve readability), the commands you type in can be edited, added, or deleted just like if you were writing an ordinary document. You can run them again later, or build up complex commands and functions over several lines of text.

There is an important practical distinction between the Console and the Editor: In the Console, the Enter key runs the command. In the Editor, it just adds a new line. The purpose of this is to facilitate writing scripts without running each line of code. It also enables you to break down your commands over multiple lines so that you don’t end up with a line that’s hundreds of characters long. For example:

poisson_model <- glm( # R knows that an open bracket can't be the end of command...
  n_events ~ gender + scale(age) + scale(n_children) + # ...nor can a plus...
    I(SES - min(SES)) * scale(years_emp, , F), # ...or a comma
  df, family = "poisson") # closing bracket CAN be the end

The hash (#) marks everything to the right of it as comment. Comments are useful for annotating the code so that you can remember what it means when you return to your code months later (it will happen!). It also improves code readability if you’re working on a script in collaboration with others. Comments should be clear but also concise. There is no point in paragraphs of verbose commentary.

Writing (and saving) scripts has just too many advantages over coding in the console to list and it it is crucial that you learn how to do it. It will enable you to write reproducible code you can rerun whenever needed, reuse chunks of code you created for a previous project in your analysis, and, when you realise you made a mistake somewhere (when, not if, because this too will happen!), you’ll be able to edit the code and recreate your analysis in a small fraction of the time it would take you to analyse your data anew without a script. This way, if you write a command and it doesn’t do exactly what you wanted to do, you can quickly tweak it in the editor and run it again. Also, if you accidentally modify and object and mess everything up, you can just re-run your entire script up to that point and pretend nothing ever happened. Or different still, let’s say you analysed your data and wrote your report and then you realised you made a mistake, for instance forgot to exclude data you should have excluded or excluded some you shouldn’t have. Without a “paper-trail” of your analysis, this is a very unpleasant (but, sadly, not unheard of) experience. But with a saved analysis script, you can just insert an additional piece of code in the file, re-run it and laugh it off. Or do the first two, and then take a long hard look at yourselfhatever the case, using the script editor is just very, very useful!

However, the aim is to keep the script tidy. You don’t need to put every single line of code you ever run into it. Sometimes, you just want to look at your data, using, for example View(df). This kind of command really doesn’t need to be in your script. As a general rule of thumb, use the editor for code that adds something of value to the sequence of the script (data cleaning, analysis, code generating plots, tables, etc.) and the console for one-off commands (when you want to just check something).

Here is an example of what a neat script looks like (part of the Reproducibility Project: Psychology analysis2). Compare it to your own scripts and try to find ways of improving your coding.

For useful “good practice” guidance on how to write legible code, see the Google style guide.

 

Basic principles of R programming

The devil’s in the detail

It takes a little practice to develop good coding habits. As a result, you will likely get a lot of errors when you first try to do things in R. That’s perfectly normal and the following the tips below will make it a lot better rather quickly. Promise!

When you do encounter an error or when R does something other than what you wanted, it means that, somewhere along the way, there has been a mistake in at least one of the main components of the R language:

Vocabulary

Simply put, you used the wrong word and so R understood something other than what you intended. Translated into a more programming language, you used the incorrect function to perform some operation or performed the right operation on the wrong input. Maybe you wanted to calculate the median of a variable but instead, you calculated the mean. Or maybe you calculated the median as you wanted but of the wrong variable. Different still, you might have used the right function an the right object but R does not know the function because you haven’t made it accessible (more on this in the section on packages. The former case is usually a matter of knowing the names of your functions, which comes with time. The latter two are more of a matter of attention to detail.

Grammar

Grammatical mistakes basically consist of using the right vocabulary but using it wrong. This can be prevented by learning how to use the commands you want to use or at least knowing how to find out about them. For a more in-depth discussion, see the section on functions.

Syntax

The third pitfall consists in using the right words in the right way but stringing them together wrong. Since programming languages are formal, things like order and placement of commas and brackets matter a great deal. This is usually the source of most of the frustration caused by people’s early experience with R. To avoid running into syntactic problems, try to always follow these principles:

  • Every open bracket ((, [, {) has to be closed at some point.
  • Commas are functional and have their place. They are used to separate arguments in functions and dimensions in subsetting. As such, they can only be used inside ()s and []s.
  • White spaces are optional. You are free (and indeed encouraged) to use them but, if you do, bear in mind they must not be inserted inside a name of a variable or function.
    • For instance, there is a function called as.numeric. There may not be a white space anywhere within this name.
  • Any unquoted (not surrounded by 's or "s) string of letters is interpreted as a name of some variable, dataset, or function. Conversely, any quoted string is interpreted literally as a meaningless string of characters.
    • mean is a name of the function that computes the arithmetic mean (e.g., mean(c(1, 3, 4, 8, 100)) gives the mean of the numbers 1, 3, 4, 8, and 100) but "mean" is just a string of letters and performs no function in R.
  • R is case sensitivea is not the same as A.

Naturally, these guidelines won’t mean all that much to you if you are completely new to programming. That’s OK. Come back to them once you’ve finished reading this document. They will appear much more useful then.

If you want to keep it, put it in a box

Everything in life is merely transient; we ourselves are pretty ephemeral beings. (#sodeepbro) However, R takes this quality and runs with it. If you ask R to perform any operation, it will spew it out into the console and immediately forget it ever happened. Let’s show you what that means:

# create an object (variable) a and assign it the value of 1
a <- 1

# increment a by 1
a + 1
[1] 2

# OK, now see what the value of a is
a
[1] 1

 

So, R as if forgot we asked it to do a + 1 and didn’t change its value. The only way to keep this new value is to put it in an object.

b <- a + 1

# now let's see
b
[1] 2

 

Think of objects as boxes. The names of the objects are only labels. Just like with boxes, it is convenient to label boxes in a way that is indicative of their contents, but the label itself does not determine the content. Sure, you can create an R object called one and store the value of 2 in it, if you wish. But you might want to think about whether or not it is a helpful name. And what kind of person that makes you… Objects can contain anything at all: values, vectors, matrices, data, graphs, tables, even code. In fact, every time you call a function, e.g., mean(), you are running the code that’s inside the object mean with whatever values you pass to the arguments of the function.

Let’s demonstrate this last point:

# let's create a vector of numbers the mean of which we want to calculate
vec <- c(103, 1, 1, 6, 3, 43, 2, 23, 7, 1)

# see what's inside
vec
 [1] 103   1   1   6   3  43   2  23   7   1

# let's get the mean
# mean is the sum of all values divided by the number of values
sum(vec)/length(vec)
[1] 19

# good, now let's create a function that calculates
# the mean of whatever we ask it to
function(x) {sum(x)/length(x)}
function(x) {sum(x)/length(x)}
<environment: 0x000001dc84292348>

# but as we discussed above, R immediately forgot about the function
# so we need to store it in a box (object) to keep it for later!
calc.mean <- function(x) {sum(x)/length(x)}

# OK, all ready now
calc.mean(x = vec)
[1] 19

# the code inside the object calc.mean is reusable
calc.mean(x = c(3, 5, 53, 111))
[1] 43

# to show that calc.mean is just an object with some code in it,
# you can look inside, just like with any other object
calc.mean
function(x) {sum(x)/length(x)}
<environment: 0x000001dc84292348>

 

Let this be your mantra: “If I want to keep it for later, I need to put it in an object so that is doesn’t go off.”

 

You can’t really change an object

Unlike in the physical world, objects in R cannot truly change. The reason is that, sticking to our analogy, these objects are kind of like boxes. You can put stuff in, take stuff out and that’s pretty much it. However, unlike boxes, when you take stuff out of objects, you only take out a copy of its contents. The original contents of the box remain intact. Of course you can do whatever you want (within limits) to the stuff once you’ve taken it out of the box but you are only modifying the copy. And unless you put that modified stuff into a box, R will forget about it as soon as it’s done with it. Now, as you probably know, you can call the boxes whatever you want (again, within certain limits). What might not have occurred to you though, is that you can call the new box the same as the old one. When that happens, R basically takes the label off the old box, pastes it on the new one and burns the old box. So even though some operations in R may look like they change objects, under the hood R copies their content, modifies it, stores the result in a different object puts the same label on it and discards the original object. Understanding this mechanism will make things much easier!

Putting the above into practice, this is how you “change” an R object:

# put 1 into an object (box) called a
a <- 1

# copy the content of a, add 1 to it and store it in an object b
b <- a + 1

# copy what's inside b and put it in a new object called a
# discarding the old object a
a <- b

# now see what's inside of a
# (by copying its content and pasting it in the console)
a
[1] 2

 

Of course, you can just cut out the middleman (object b). So to increment a by another 1, we can do:

a <- a + 1

a
[1] 3

 

It’s elementary, my dear Watson

When it comes to data, every vector, matrix, list, data frame - in other words, every structure - is composed of elements. An element is a single number, boolean (TRUE/FALSE), or a character string (anything in “quotes”). Elements come in several classes:

  • "numeric", as the name suggests, a numeric element is a single number: 1, 2, -725, 3.14159265, etc.. A numeric element is never in ‘single’ or “double” quotesumbers are cool because you can do a lot of maths (and stats!) with them.

     

  • "character", a string of characters, no matter how long. It can be a single letter, 'g', but it can equally well be a sentence, “Elen síla lumenn’ omentielvo.” (if you want the string to contain any single quotes, use double quotes to surround the string with and vice versa). Notice that character strings in R are always in ‘single’ or “double” quotes. Conversely anything in quotes is a character string:

    class(3)
    [1] "numeric"
    class("3") # in quotes, therefore character!
    [1] "character"

    It stands to reason that you can’t do any maths with cahracter strings, not even if it’s a number that’s inside the quotes!

    "3" + "2"
    Error in "3" + "2": non-numeric argument to binary operator

     

  • "logical", a logical element can take one of two values, TRUE or FALSE. Logicals are usually the output of logical operations (anything that can be phrased as a yes/no question, e.g., is x equal to y?). In formal logic, TRUE is represented as 1 and FALSE as 0. This is also the case in R:

    # recall that c() is used to bind elements into a vector
    # (that's just a fancy term for an ordered group of elements)
    class(c(TRUE, FALSE))
    [1] "logical"
    # we can force ('coerce', in R jargon) the vector to be numeric
    as.numeric(c(TRUE, FALSE))
    [1] 1 0

    This has interesting implications. First, is you have a logical vector of many TRUEs and FALSEs, you can quickly count the number of TRUEs by just taking the sum of the vector:

    # consider vector of 50 logicals
    x
     [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
     [ reached getOption("max.print") -- omitted 40 entries ]
    # number of TRUEs
    sum(x)
    [1] 30
    # number of FALSEs is 50 minus number of TRUEs
    length(x) - sum(x)
    [1] 20

    Second, you can perform all sorts of arithmetic operations on logicals:

    # TRUE/FALSE can be shortened to T/F
    T + T
    [1] 2
    F - T
    [1] -1
    (T * T) + F
    [1] 1

    Third, you can coerce numeric elements to valid logicals:

    # zero is FALSE
    as.logical(0)
    [1] FALSE
    # everything else is TRUE
    as.logical(c(-1, 1, 12, -231.3525))
    [1] TRUE TRUE TRUE TRUE

    Now, you may wonder that use this can possible be? Well, this way you can perform basic logical operations, such as AND, OR, and XOR (see section “Handy functions that return logicals” below):

    # x * y is equivalent to x AND y
    as.logical(T * T)
    [1] TRUE
    as.logical(T * F)
    [1] FALSE
    as.logical(F * T)
    [1] FALSE
    as.logical(F * F)
    [1] FALSE
    # x + y is equivalent to x OR y
    as.logical(T + T)
    [1] TRUE
    as.logical(T + F)
    [1] TRUE
    as.logical(F + T)
    [1] TRUE
    as.logical(F + F)
    [1] FALSE
    # x - y is equivalent to x XOR y (eXclusive OR, either-or)
    as.logical(T - T)
    [1] FALSE
    as.logical(T - F)
    [1] TRUE
    as.logical(F - T)
    [1] TRUE
    as.logical(F - F)
    [1] FALSE

     

  • "factor", factors are a bit weird. They are used mainly for telling R that a vector represents a categorical variable. For instance, you can be comparing two groups, treatment and control.

    # create a vector of 15 "control"s and 15 "treatment"s
    # rep stands for 'repeat', which is exactly what the function does
    x <- rep(c("control", "treatment"), each = 15)
    x
     [1] "control" "control" "control" "control" "control" "control" "control" "control" "control" "control"
     [ reached getOption("max.print") -- omitted 20 entries ]
    # turn x into a factor
    x <- as.factor(x)
    x
     [1] control control control control control control control control control control
     [ reached getOption("max.print") -- omitted 20 entries ]
    Levels: control treatment

    The first thing to notice is the line under the last printout that says “Levels: control treatment”. This informs you that x is now a factor with two levels (or, a categorical variable with two categories).

    Second thing you should take note of is that the words control and treatment don’t have quotes around them. This is another way R uses to tell you this is a factor.

    With factors, it is important to understand how they are represented in R. Despite, what they look like, under the hood, they are numbers. A one-level factor is a vector of 1s, a two-level factor is a vector of 1s and 2s, a n-level factor is a vector of 1s, 2s, 3s … ns. The levels, in our case control and treatment, are just labels attached to the 1s and 2s. Let’s demonstrate this:

    typeof(x)
    [1] "integer"
    # integer is fancy for "whole number"
    
    # we can coerce factors to numeric, thus stripping the labels
    as.numeric(x)
     [1] 1 1 1 1 1 1 1 1 1 1
     [ reached getOption("max.print") -- omitted 20 entries ]
    # see the labels
    levels(x)
    [1] "control"   "treatment"

    The labels attached to the numbers in a factor can be whatever. Let’s say that in your raw data file, treatment group is coded as 1 and control group is coded as 0.

    # create a vector of 15 zeros and 15 ones
    x <- rep(0:1, each = 15)
    x
     [1] 0 0 0 0 0 0 0 0 0 0
     [ reached getOption("max.print") -- omitted 20 entries ]
    # turn x into a factor
    x <- as.factor(x)
    x
     [1] 0 0 0 0 0 0 0 0 0 0
     [ reached getOption("max.print") -- omitted 20 entries ]
    Levels: 0 1

    Since x is now a factor with levels 0 and 1, we know that it is stored in R as a vector of 1s and 2s and the zeros and ones, representing the groups, are only labels:

    as.numeric(x)
     [1] 1 1 1 1 1 1 1 1 1 1
     [ reached getOption("max.print") -- omitted 20 entries ]
    levels(x)
    [1] "0" "1"

    The fact that factors in R are represented as labelled integers has interesting implications some of you have already come across. First, certain functions will coerce factors into numeric vectors which can shake things up. This happened when you used cbind() on a factor with levels 0 and 1:

    x
     [1] 0 0 0 0 0 0 0 0 0 0
     [ reached getOption("max.print") -- omitted 20 entries ]
    Levels: 0 1
    # let's bind the first 15 elements and the last 15 elements together as columns
    cbind(x[1:15], x[16:30])
          [,1] [,2]
     [1,]    1    2
     [2,]    1    2
     [3,]    1    2
     [4,]    1    2
     [5,]    1    2
     [ reached getOption("max.print") -- omitted 10 rows ]
    # printout truncated to first 5 rows to save space

    cbind() binds the vectors you provide into the columns of a matrix. Since matrices (yep, that’s the plural of ‘matrix’; also, more on matrices later) can only contain logical, numeric, and character elements, the cbind() function coerces the elements of the x factor (haha, the X-factor) into numeric, stripping the labels and leaving only 1s and 2s.

    The other two consequences of this labelled numbers system stem from the way the labels are stored. Every R object comes with a list of so called attributes attached to it. These are basically information about the object. For objects of class factor, the attributes include its levels (or the labels attached to the numbers) and class:

    attributes(x)
    $levels
    [1] "0" "1"
    
    $class
    [1] "factor"

    So the labels are stored separately of the actual elements. This means, that even if you delete some of the numbers, the labels stay the same. Let’s demonstrate this implication on the plot() function. This function is smart enough to know that if you give it a factor it should plot it using a bar chart, and not a histogram or a scatter plot:

    plot(x)

    Now, let’s take the first 15 elements of x, which are all 0s and plot them:

    y <- x[1:15]
    y
     [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    Levels: 0 1
    plot(y)

    Even though our new object y only includes 0s, the levels attribute still tells R that this is a factor of (at least potentially) two levels: "0" and "1" and so plot() leaves a room for the 1s.

    The last consequence is directly related to this. Since the levels of an object of class factor are stored as its attributes, any additional values put inside the objects will be invalid and turned into NAs (R will warn us of this). In other words, you can only add those values that are among the ones produced by levels() to an object of class factor:

    # try adding invalid values -4 and 3 to the end of vector x
    x[31:32] <- c(-4, 3)
    x
     [1] 0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    1    1    1    1    1    1   
    [23] 1    1    1    1    1    1    1    1    <NA> <NA>
    Levels: 0 1

    The only way to add these values to a factor is to first coerce it to numeric, then add the values, and then turn it back into factor:

    # coerce x to numeric
    x <- as.numeric(x[1:30])
    class(x)
    [1] "numeric"
    # but remember that 0s and 1s are now 1s and 2s!
    x
     [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    # so subtract 1 to make the values 0s and 1s again
    x <- x - 1
    # add the new values
    x <- c(x, -4, 3)
    # back into fractor
    x <- as.factor(x)
    x
     [1] 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  -4 3 
    Levels: -4 0 1 3
    # SUCCESS!
    
    # reset
    x <- as.factor(rep(0:1, each = 15))
    # one-liner
    x <- as.factor(c(as.numeric(x[1:30]) - 1, -4, 3))
    x
     [1] 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  -4 3 
    Levels: -4 0 1 3

    Told you factors were weird…

     

  • "ordered", finally, these are the same as factors but, in addition to having levels, these levels are ordered and thus allow comparison (notice the Levels: 0 < 1 below):

    # coerce x to numeric
    x <- as.ordered(rep(0:1, each = 15))
    x
     [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    Levels: 0 < 1
    
    # we can now compare the levels
    x[1] < x[30]
    [1] TRUE
    # this is not the case with factors
    y <- as.factor(rep(0:1, each = 15))
    y[1] < y[30]
    [1] NA

    Objects of class ordered are useful for storing ordinal variables, e.g., age group.

     

In addition to these five sorts of elements, there are three special wee snowflakes:

  • NA, stands for “not applicable” and is used for missing data. Unlike other kinds of elements, it can be bound into a vector along with elements of any class.

  • NaN, stands for “not a number”. It is technically of class numeric but only occurs as the output of invalid mathematical operations, such as dividing zero by zero or taking a square root of a negative number:

    0 / 0
    [1] NaN
    sqrt(-12)
    [1] NaN
  • Inf (or -Inf), infinity. Reserved for division of a non-zero number by zero (no, it’s not technically right):

    235/0
    [1] Inf
    -85.123/0
    [1] -Inf

 

Data structures

So that’s most of what you need to know about elements. Let’s talk about putting elements together. As mentioned above, elements can be grouped in various data structures. These differ in the ways in which they arrange elements:

  • vectors arrange elements in a line. they don’t have dimensions and can only contain elements of same class (e.g., "numeric", "character", "logical").

    # a vector
    letters[5:15]
     [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"

    If you try to force elements of different classes to a single vector, they will all be converted to the most complex class. The order of complexity, from least to most complex, is: logical, numeric, and character. Elements of class factor and ordered cannot be meaningfully bound in a vector with other classes (nor with each other): they either get converted to numeric, character - if you’re lucky - or to NA.

    # c(logical, numeric) results in numeric
    x <- c(T, F, 1:6)
    x
    [1] 1 0 1 2 3 4 5 6
    class(x)
    [1] "integer"
    # integer is like numeric but only for whole numbers to save computer memory
    
    # adding character results in character
    x <- c(x, "foo")