In this practice session, we will focus on using functions, creating datasets, and querying lists. Again, if you do not manage to complete all the tasks, don’t worry, there are quite a few.
Either open a new
R Script file or continue the one from previous practice session.
Type in/copy the following code in your script and run it.
Look at the
my_data object to see what’s inside.
Printing out datasets into the console is a little cumbersome, especially for larger datasets. Luckily, there are two handy functions we can use for a quick look at the data:
head() to print out only the first few rows and
View() to display the data in a new spreadsheet-like RStudio tab.
Use both of these functions to see what they do.
View()is upper case.
OK, now that we know what the data look like, let’s focus on the code that created them. Figuring out code by contrasting it with its output and systematically breaking it down is an invaluable skill for learning a programming language.
First of all, the
set.seed() command is not necessary for generating data but it’s important for reproducibility. The code that creates the dataset uses random sampling. Now, because computers are deterministic, nothing they do is truly random. They can generate numbers that appear random but the process that generated them is deterministic. By setting random seed, we can make sure that we will always get the same results for our “randomly” generated data.
Next, we create an object
N that is a numeric vector of length = 1 with the number 40 as its only element. We will soon see what this was good for.
Let’s move on the the meat of the code. We know that the code gives us a data set (data frame) with four variables,
score. Naturally, this can be read from the code itself:
A great thing about code is its modularity. Here we are creating four variables or, in other words, four separate vectors. That means, that there are (at least) four separate commands that we can look at independently.
Let’s start with
Type up and run only the code that creates the content of the
age =, except for the final comma.
replicate(n = N, expr = paste(sample(c(LETTERS, 0:9), 6, replace = T), collapse=""))  "EG85SC" "UIJWNY" "UGCO95" "0T727A" "HGV78I" "PJVGWA" "JD8VEY" "CAME6H" "I3TU5A" "LZZ1QW" "IYZHLD" "67XTF1" "J4LHL6" "837YA4"  "FR3BGK" "PHNVC0" "H81EV8" "RTAEOS" "LKSGAV" "PTYE96" "OWPWQE" "PUY1LM" "KRS161" "DOUCR8" "Z26SZG" "163NHV" "BVZAYY" "8G2PSF"  "P9CSLE" "8BLRXE" "4PUO80" "YUXMPH" "GZ36JX" "DDXTVS" "CSFZ92" "H5FLSF" "ECD44U" "WGDH3S" "DBASGG" "G45I97"
So now we know that this line somehow creates a vector of alphanumeric strings much like (but not exactly like, because “random”) our
id variable. Let’s dig a bit deeper to find out what exactly the code is doing. As you can see, there are a few sets of nested brackets. Since brackets always belong to a function, what we have here is a command within a command within…
Identify, type, and run the innermost command.
OK, you already know what this one does. It takes elements or vectors and combines them into a single vector. From the output, it is apparent that
LETTERS is an object that contains the 24 upper-case letters of the English alphabet and
0:9 produces a vector containing the sequence of numbers from 0 to 9.
It’s good to build up coding intuition. Can you guess how to get all 24 lower-case letters of the English alphabet?
LETTERSgives you the upper-case characters…
what command produces a sequence of integers from 24 to 12?
Now that you know what the most deeply nested command does, let’s move one level up (out?). Run the entire
sample(...) command to see what it does.
Right, so this command sampled six elements from the vector we gave it (
c(LETTERS, 0:9)). A good hypothesis would be that the
6 in the position of the second argument to the
sample() function has something to do with the size of the sample.
Let’s test the hypothesis by changing the
6 to, let’s say,
Guess the hypothesis was correct. Now, chances are that some of the characters in the vector you just generated appear more than once. This is what the
replace= argument governs. The default value of the argument is
FALSE so if it’s omitted, the function performs sampling without replacement. If we change it to
TRUE (or just
T), the sampling will be with replacement.
Don’t take our word for it though! Look at the documentation for the function by running the
As you can see under “Usage”, the function takes four arguments.
prob= both have pre-specified default values and so are not required. However, without specifying the first two arguments, the function won’t have enough information to run. Reading on, you can see under “Arguments” that the second argument (
sample() does not use the
n= argument at all) does indeed govern the size of the sample and the third one decides whether or not sampling is done with replacement.
In our code, we didn’t use the names of the first two arguments (
size=). Instead, we relied on argument matching (that’s a link you can click on).
Remember, as long as you enter arguments in the correct order, you do not need to use their names. Conversely, if you provide the names, the order in which you enter the arguments does not matter!Including the names can improve readability though so it’s a good idea to type out the names anyway, especially early in your
OK, let’s move one level up and expand our command.
Type out the
paste(...) comand to see what it does.
It appears that the command collapsed the vector of six alphanumeric characters into a single string.
paste() function is quite powerful and can do several things. In its basic forms, it pastes together its arguments into a single string, separating the individual arguments by a white space:
We can modify the separator using the
You can also provide vectors as arguments to the function:
In our case, however, we only specified a single argument, because the whole
sample(c(LETTERS, 0:9), 6, replace = T) only returns one thing (vector). Therefore, there is nothing to paste together:
This is where the
collapse= argument comes into play. As its name suggests, it collapses the entire output of the
paste() operation into a single string separated by the character given to the argument:
So, given our command, there is no pasting to be done but the sole argument gets collapsed into a single string.
Finally, let’s add the outmost command.
Re-run the entire command that created the content of the
replicate(n = N, expr = paste(sample(c(LETTERS, 0:9), 6, replace = T), collapse=""))  "93AVFM" "YXY1DW" "8A7RO1" "ZV7KDF" "59KAFF" "K82QO8" "3EVC2M" "3PLQDV" "YRSLRZ" "QWME7Z" "T18G73" "EU0AP8" "4L8EKD" "VPJGDA"  "857ZEQ" "NLY0P3" "1IZ0JN" "8E1FTC" "1URZUV" "U3IHPE" "682GX2" "T7IUSI" "U9PZE5" "BCS72E" "P3IXFJ" "ZL5L9G" "IZCYQT" "XKFWGF"  "3Y5WSZ" "2U0E02" "S6KGF4" "FH929B" "8PSTMP" "QWWNT0" "JGOQNB" "PS493Y" "VUYVAU" "1CVCG6" "FR2ZS5" "CCHXY8"
It will likely be evident that
replicate(), well… replicates the expression (command) in
expr= a number (
n=) of times. The value we provided to the
n= argument is
N which we created right after setting the random seed (it contains a single value: the number 40).
If you’re running out of time skip the next few tasks and go straight to the Lists section. You can always come back and finish this part in your own time.
OK, let’s do one more variable,
Again, start with the innermost command. Type it up and run it to see what it does. Then pull up the documentation for the function and skim through it.
foo()can be accessed by running the
rnorm(N, 45, 13)  39.82603 46.35157 71.83970 44.14375 43.38146 43.85869 33.94375 24.13278 30.85007 26.01664 39.29448 55.63783 33.47973 35.18182  66.16844 60.56862 22.90030 52.73578 64.83490 38.00295 60.60787 71.01762 53.83161 54.16078 56.45309 31.99094 68.11052 57.22906  44.16226 48.11771 57.28223 59.89508 52.95656 50.21191 42.23236 38.23178 35.29185 37.22630 42.94053 60.56896
Hopefully, you will have understood that the command generates
N Random observations from the NORMal distribution with a mean = 45 and SD = 13.
Since we want all the variables in our dataset to have the same number of observations, we can avoid having to specify 40 over and over again by using our object
N. This has the added benefit of being able to change
N to any other number, thereby changing the size of our dataset.
Type up the complete command that created the age variable.
As you can see the numbers are now rounded to whole numbers.
Can you figure out how to round the numbers to 2 decimal places?
At this stage you should have a much better understanding of the code that created the dataset:
You can figure out the code behind the
score variables for homework.
Finally, we can use the dataset we have to demonstrate how to query lists for results of statistical analysis.
Unless you are doing advanced programming or data processing, you will probably never need to create lists yourself. Vectors and data frames will get most of the jobs done. However, many of
R’s functions that perform statistical tests return the results in the form of lists so it’s essential to know how to query them for the information you’re after.
The code below performs the Welch’s t-test comparing our two groups (
"Experimental") on the
score variable. We will explain the code in another session so don’t worry about it at this stage.
Print out the
t_test object to see what’s inside.
This print output gives you all the information you might need to interpret the results of the t-test. However, you might have noticed a strange thing: the reason why we’re doing this is to show you how to query lists but the output doesn’t really look like a list. That’s because the
t.test() function is written in a way as to return the results “invisibly” and instead show you this readable summary of the test.
However, the list is still stored in our
t_test you can clearly see that in your Global Environment, were it says “
List of 10” next to
str() function on the object to see the structure of the list.
str(t_test) List of 10 $ statistic : Named num -1.82 ..- attr(*, "names")= chr "t" $ parameter : Named num 34 ..- attr(*, "names")= chr "df" $ p.value : num 0.0784 $ conf.int : num [1:2] -2.438 0.138 ..- attr(*, "conf.level")= num 0.95 $ estimate : Named num [1:2] 4.85 6 ..- attr(*, "names")= chr [1:2] "mean in group control" "mean in group experimental" $ null.value : Named num 0 ..- attr(*, "names")= chr "difference in means" $ stderr : num 0.634 $ alternative: chr "two.sided" $ method : chr "Welch Two Sample t-test" $ data.name : chr "score by group" - attr(*, "class")= chr "htest"
Next to each
$, you can see the name of one of the elements of the
t_test list. By calling
t_test$name_of_element you can access the information stored in that particular element.
That’s a bit too many decimal places. Round the value to 3 dp as per APA guidelines1.
round()function with its
Get the confidence interval bounds out of
Nicely done! Let’s stop to think what you achieved in this practice session.
R. Remember, every time you see some daunting looking piece of code, you can always break it down into more easily digestible bits in order to understand it.
Reverything that exists is an object and everything that happens is a function call. For a more detailed treatment of functions, please see this section of the “Getting into
All hail APA 6th!↩︎