In this practice session, we will focus on using functions, creating datasets, and querying lists. Again, if you do not manage to complete all the tasks, don’t worry, there are quite a few.

## Data sets

Either open a new R Script file or continue the one from previous practice session.

Type in/copy the following code in your script and run it.

set.seed(12345)
N <- 40
my_data <- data.frame(
id = replicate(n = N,
expr = paste(sample(c(LETTERS, 0:9), 6, replace = T),
collapse="")),
age = round(rnorm(N, 45, 13)),
group = rep(c("control", "experimental"), N/2),
score = rpois(N, 5)
)

Look at the my_data object to see what’s inside.

Hint Simply type its name into the console and press ↵ Enter

Printing out datasets into the console is a little cumbersome, especially for larger datasets. Luckily, there are two handy functions we can use for a quick look at the data: head() to print out only the first few rows and View() to display the data in a new spreadsheet-like RStudio tab.

Use both of these functions to see what they do.

Hint Attention to details is really essential when it comes to coding. For instance, notice that the “V” in View() is upper case.

Solution

head(my_data)
id age        group score
1 NPZ1X2  26      control     3
2 K5XBVK  37 experimental     3
3 3JQ53A  48      control     4
4 LTHLCI  59 experimental     5
5 NMTPP5  56      control     4
6 75Y9KI  46 experimental     6
View(my_data)

OK, now that we know what the data look like, let’s focus on the code that created them. Figuring out code by contrasting it with its output and systematically breaking it down is an invaluable skill for learning a programming language.

First of all, the set.seed() command is not necessary for generating data but it’s important for reproducibility. The code that creates the dataset uses random sampling. Now, because computers are deterministic, nothing they do is truly random. They can generate numbers that appear random but the process that generated them is deterministic. By setting random seed, we can make sure that we will always get the same results for our “randomly” generated data.

Next, we create an object N that is a numeric vector of length = 1 with the number 40 as its only element. We will soon see what this was good for.

Let’s move on the the meat of the code. We know that the code gives us a data set (data frame) with four variables, id, age, group, and score. Naturally, this can be read from the code itself:

... data.frame(
id = ...,
age = ...,
group = ...,
score = ...
)

## Learning by deconstructing code

A great thing about code is its modularity. Here we are creating four variables or, in other words, four separate vectors. That means, that there are (at least) four separate commands that we can look at independently.

Let’s start with id.

Type up and run only the code that creates the content of the id variable.

Hint That’s everything in between id = and age =, except for the final comma.

Solution

replicate(n = N, expr = paste(sample(c(LETTERS, 0:9), 6, replace = T), collapse=""))
[1] "EG85SC" "UIJWNY" "UGCO95" "0T727A" "HGV78I" "PJVGWA" "JD8VEY" "CAME6H" "I3TU5A" "LZZ1QW" "IYZHLD" "67XTF1" "J4LHL6" "837YA4"
[15] "FR3BGK" "PHNVC0" "H81EV8" "RTAEOS" "LKSGAV" "PTYE96" "OWPWQE" "PUY1LM" "KRS161" "DOUCR8" "Z26SZG" "163NHV" "BVZAYY" "8G2PSF"
[29] "P9CSLE" "8BLRXE" "4PUO80" "YUXMPH" "GZ36JX" "DDXTVS" "CSFZ92" "H5FLSF" "ECD44U" "WGDH3S" "DBASGG" "G45I97"

So now we know that this line somehow creates a vector of alphanumeric strings much like (but not exactly like, because “random”) our id variable. Let’s dig a bit deeper to find out what exactly the code is doing. As you can see, there are a few sets of nested brackets. Since brackets always belong to a function, what we have here is a command within a command within…

Identify, type, and run the innermost command.

Hint The one with the c().

Solution

c(LETTERS, 0:9)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "0" "1" "2" "3" "4"
[32] "5" "6" "7" "8" "9"

OK, you already know what this one does. It takes elements or vectors and combines them into a single vector. From the output, it is apparent that LETTERS is an object that contains the 24 upper-case letters of the English alphabet and 0:9 produces a vector containing the sequence of numbers from 0 to 9.

It’s good to build up coding intuition. Can you guess how to get all 24 lower-case letters of the English alphabet?

Hint If LETTERS gives you the upper-case characters…

Solution

letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

what command produces a sequence of integers from 24 to 12?

Solution

24:12
[1] 24 23 22 21 20 19 18 17 16 15 14 13 12

Now that you know what the most deeply nested command does, let’s move one level up (out?). Run the entire sample(...) command to see what it does.

Solution

sample(c(LETTERS, 0:9), 6, replace = T)
[1] "E" "Y" "7" "L" "O" "G"

Right, so this command sampled six elements from the vector we gave it (c(LETTERS, 0:9)). A good hypothesis would be that the 6 in the position of the second argument to the sample() function has something to do with the size of the sample.

Let’s test the hypothesis by changing the 6 to, let’s say, 15.

Solution

sample(c(LETTERS, 0:9), 15, replace = T)
[1] "3" "R" "M" "I" "H" "8" "W" "5" "6" "5" "Z" "2" "D" "Y" "1"

Guess the hypothesis was correct. Now, chances are that some of the characters in the vector you just generated appear more than once. This is what the replace= argument governs. The default value of the argument is FALSE so if it’s omitted, the function performs sampling without replacement. If we change it to TRUE (or just T), the sampling will be with replacement.

Don’t take our word for it though! Look at the documentation for the function by running the ?sample command.

As you can see under “Usage”, the function takes four arguments. replace= and prob= both have pre-specified default values and so are not required. However, without specifying the first two arguments, the function won’t have enough information to run. Reading on, you can see under “Arguments” that the second argument (size= because sample() does not use the n= argument at all) does indeed govern the size of the sample and the third one decides whether or not sampling is done with replacement.

In our code, we didn’t use the names of the first two arguments (x= and size=). Instead, we relied on argument matching (that’s a link you can click on).

Remember, as long as you enter arguments in the correct order, you do not need to use their names. Conversely, if you provide the names, the order in which you enter the arguments does not matter!

Including the names can improve readability though so it’s a good idea to type out the names anyway, especially early in your R journey.

OK, let’s move one level up and expand our command.

Type out the paste(...) comand to see what it does.

Solution

paste(sample(c(LETTERS, 0:9), 6, replace = T), collapse = "")
[1] "BJJWK9"

It appears that the command collapsed the vector of six alphanumeric characters into a single string.

An aside on paste()

The paste() function is quite powerful and can do several things. In its basic forms, it pastes together its arguments into a single string, separating the individual arguments by a white space:

paste("paste()", "is", "cool")
[1] "paste() is cool"

We can modify the separator using the sep= argument:

paste("paste()", "is", "cool", sep = "-")
[1] "paste()-is-cool"
# empty string means no separator
paste("paste()", "is", "cool", sep = "")
[1] "paste()iscool"

You can also provide vectors as arguments to the function:

paste(c("I", "You", "We all"), "love R!")
[1] "I love R!"      "You love R!"    "We all love R!"
paste(c("I", "You", "We all"), c("love", "hate", "feel rather ambivalent about"), "R!")
[1] "I love R!"                              "You hate R!"                            "We all feel rather ambivalent about R!"

In our case, however, we only specified a single argument, because the whole sample(c(LETTERS, 0:9), 6, replace = T) only returns one thing (vector). Therefore, there is nothing to paste together:

paste(sample(c(LETTERS, 0:9), 6, replace = T))
[1] "X" "D" "7" "L" "T" "F"

This is where the collapse= argument comes into play. As its name suggests, it collapses the entire output of the paste() operation into a single string separated by the character given to the argument:

paste(1:5, collapse = "-")
[1] "1-2-3-4-5"
paste(c("I", "You", "We all"), "love R!", sep="*", collapse="///")
[1] "I*love R!///You*love R!///We all*love R!"

So, given our command, there is no pasting to be done but the sole argument gets collapsed into a single string.

Finally, let’s add the outmost command.

Re-run the entire command that created the content of the id variable.

Solution

replicate(n = N, expr = paste(sample(c(LETTERS, 0:9), 6, replace = T), collapse=""))
[1] "93AVFM" "YXY1DW" "8A7RO1" "ZV7KDF" "59KAFF" "K82QO8" "3EVC2M" "3PLQDV" "YRSLRZ" "QWME7Z" "T18G73" "EU0AP8" "4L8EKD" "VPJGDA"
[15] "857ZEQ" "NLY0P3" "1IZ0JN" "8E1FTC" "1URZUV" "U3IHPE" "682GX2" "T7IUSI" "U9PZE5" "BCS72E" "P3IXFJ" "ZL5L9G" "IZCYQT" "XKFWGF"
[29] "3Y5WSZ" "2U0E02" "S6KGF4" "FH929B" "8PSTMP" "QWWNT0" "JGOQNB" "PS493Y" "VUYVAU" "1CVCG6" "FR2ZS5" "CCHXY8"

It will likely be evident that replicate(), well… replicates the expression (command) in expr= a number (n=) of times. The value we provided to the n= argument is N which we created right after setting the random seed (it contains a single value: the number 40).

If you’re running out of time skip the next few tasks and go straight to the Lists section. You can always come back and finish this part in your own time.

OK, let’s do one more variable, age.

Again, start with the innermost command. Type it up and run it to see what it does. Then pull up the documentation for the function and skim through it.

Hint Documentation for some function foo() can be accessed by running the ?foo command.

Solution

rnorm(N, 45, 13)
[1] 39.82603 46.35157 71.83970 44.14375 43.38146 43.85869 33.94375 24.13278 30.85007 26.01664 39.29448 55.63783 33.47973 35.18182
[15] 66.16844 60.56862 22.90030 52.73578 64.83490 38.00295 60.60787 71.01762 53.83161 54.16078 56.45309 31.99094 68.11052 57.22906
[29] 44.16226 48.11771 57.28223 59.89508 52.95656 50.21191 42.23236 38.23178 35.29185 37.22630 42.94053 60.56896

Hopefully, you will have understood that the command generates N Random observations from the NORMal distribution with a mean = 45 and SD = 13.

Since we want all the variables in our dataset to have the same number of observations, we can avoid having to specify 40 over and over again by using our object N. This has the added benefit of being able to change N to any other number, thereby changing the size of our dataset.

Type up the complete command that created the age variable.

Solution

round(rnorm(N, 45, 13))
[1] 51 67 46 25 44 39 59 35 43 45 76 75 37 41 34 44 45 49 53 29 41 45 26 32 38 46 68 44 25 36 44 15 72 50 36 42 43 52 35 68

As you can see the numbers are now rounded to whole numbers.

Can you figure out how to round the numbers to 2 decimal places?

Hint You can always refer to the function’s documentation…

Solution

round(rnorm(N, 45, 13), 2)
[1] 36.44 51.21 66.00 20.93 37.56 48.96 51.98 57.16 62.66 63.27 50.26 63.79 42.78 62.64 24.93 53.49 38.07 38.32 31.20 59.30 58.20
[22] 52.41 42.27 42.75 14.47 68.48 65.37 38.92 39.07 26.76 41.34 49.36 38.26 31.87 56.75 49.73 55.37 60.20 20.49 48.44

At this stage you should have a much better understanding of the code that created the dataset:

my_data <- data.frame(
id = replicate(n = N,
expr = paste(sample(c(LETTERS, 0:9), 6, replace = T),
collapse="")),
age = round(rnorm(N, 45, 13)),
group = rep(c("control", "experimental"), N/2),
score = rpois(N, 5)
)

You can figure out the code behind the group and score variables for homework.

## Lists

Finally, we can use the dataset we have to demonstrate how to query lists for results of statistical analysis.

Unless you are doing advanced programming or data processing, you will probably never need to create lists yourself. Vectors and data frames will get most of the jobs done. However, many of R’s functions that perform statistical tests return the results in the form of lists so it’s essential to know how to query them for the information you’re after.

The code below performs the Welch’s t-test comparing our two groups ("control" and "Experimental") on the score variable. We will explain the code in another session so don’t worry about it at this stage.

Run the following code:

t_test <- t.test(score ~ group, my_data)

Print out the t_test object to see what’s inside.

Solution

t_test

Welch Two Sample t-test

data:  score by group
t = -1.815, df = 33.969, p-value = 0.07836
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4376707  0.1376707
sample estimates:
mean in group control mean in group experimental
4.85                       6.00 

This print output gives you all the information you might need to interpret the results of the t-test. However, you might have noticed a strange thing: the reason why we’re doing this is to show you how to query lists but the output doesn’t really look like a list. That’s because the t.test() function is written in a way as to return the results “invisibly” and instead show you this readable summary of the test.

However, the list is still stored in our t_test you can clearly see that in your Global Environment, were it says “List of 10” next to t_test.

Use the str() function on the object to see the structure of the list.

Solution

str(t_test)
List of 10
$statistic : Named num -1.82 ..- attr(*, "names")= chr "t"$ parameter  : Named num 34
..- attr(*, "names")= chr "df"
$p.value : num 0.0784$ conf.int   : num [1:2] -2.438 0.138
..- attr(*, "conf.level")= num 0.95
$estimate : Named num [1:2] 4.85 6 ..- attr(*, "names")= chr [1:2] "mean in group control" "mean in group experimental"$ null.value : Named num 0
..- attr(*, "names")= chr "difference in means"
$stderr : num 0.634$ alternative: chr "two.sided"
$method : chr "Welch Two Sample t-test"$ data.name  : chr "score by group"
- attr(*, "class")= chr "htest"

Next to each $, you can see the name of one of the elements of the t_test list. By calling t_test$name_of_element you can access the information stored in that particular element.

Get only the p-value of our t-test.

Solution

t_test$p.value [1] 0.07835775 ### Task 11 That’s a bit too many decimal places. Round the value to 3 dp as per APA guidelines1. Hint Check out the round() function with its digits= argument. Solution round(t_test$p.value, digits = 3)
[1] 0.078
# name of the argument is optional
# round(t_test$p.value, 3) # works just as well ### Task 12 Get the confidence interval bounds out of t_test. Solution t_test$conf.int
[1] -2.4376707  0.1376707
attr(,"conf.level")
[1] 0.95

Final task. This one is a little bit of a thinker. Can you figure out code to produce the following output?

[1] "-2.44, 0.14"
Hint Both round() and paste() are involved.

Solution

paste(round(t_test\$conf.int, 2), collapse = ", ")

## Reflect

Nicely done! Let’s stop to think what you achieved in this practice session.

• You had a go at deconstructing unfamiliar code in a systematic way and learning from it. This is perhaps the most important skill for working with R. Remember, every time you see some daunting looking piece of code, you can always break it down into more easily digestible bits in order to understand it.
• You practised using functions, specifying function arguments, and reading function documentations. Understanding function calls is crucial since in R everything that exists is an object and everything that happens is a function call. For a more detailed treatment of functions, please see this section of the “Getting into R” document.
• You learnt how to get results from a statistical test and how to do some basic formatting using round() and paste().

1. All hail APA 6th!↩︎