In this session we will carry on working with the biling_data
dataset and practice data visualisation. To avoid having to redo all data cleaning (or copypaste the code) continue working from the script you created in the previous session.
While the base R
plot()
function is powerful enough to create essentially any plot you might want to create, we will focus on using the ggplot2
package. However, we’ll start with a few quick and simple base R
plots.
R
Use the hist()
function to plot a histogram of the age
variable.
Use the plot()
function to create a scatterplot of wais
and BILING
.
As you can see plot()
tries to guess what kind of plot you want to create given the classes of the variables you pass to the function.
Use plot()
again, this time with the gender
variable only.
This doesn’t look great, does it? The reason is that gender
is has not been converted into a factor.
Use plot()
again, this time with the gender
variable only.
gender
into a factor permanently by reassigning the variable or only for the purpose of the plot inside of the plot()
function.
Solution
# only converts gender to factor for the purpose of the plot
# plot(factor(biling_data$gender, labels = c("Male", "Female", "Other")))
# permanent change
biling_data$gender <- factor(biling_data$gender, labels = c("Male", "Female", "Other"))
plot(biling_data$gender)
Try plotting yearsFR
by gender
.
plot()
; it’s really rather flexible. Make sure gender
is a factor though!
Finally, let’s see what plot we get when we use two factors. Plot DALF_PASS
(as factor!) against gender
.
Solution
biling_data$DALF_PASS <- factor(biling_data$DALF_PASS, labels = c("Fail", "Pass"))
plot(biling_data$gender, biling_data$DALF_PASS)
That’s a nice mosaic plot, isn’t it!
There are numerous options and arguments to the plot()
function you can use to modify the aesthetics of the plot. Instead of dealing with those, let’s move on to ggplot()
and leave base R
plots with an example plot. It is not very pretty but it demonstrates some of the capabilities of base R
graphics:
plot(biling_data$wais, biling_data$BILING,
xlab = "IQ", ylab = "Bilingualism score", # axis labels
main = "Relationship between intelligence and bilingualism", # plot title
type = "n") # don't plot any points
points(biling_data$wais[biling_data$gender == "Male"], # plot points for gender == "Male"
biling_data$BILING[biling_data$gender == "Male"],
col = "#fac21888", # hex code for colour: #RRGGBBAA - red, green, blue, alpha (opacity)
pch = 17) # "point character" governs the shape of the point
points(biling_data$wais[biling_data$gender == "Female"], # plot points for gender == "Female"
biling_data$BILING[biling_data$gender == "Female"],
col = "#0d5f8a88",
pch = 18)
points(biling_data$wais[biling_data$gender == "Other"], # plot points for gender == "Other"
biling_data$BILING[biling_data$gender == "Other"],
col = "#660a6088",
pch = 19)
abline(h = mean(biling_data$BILING), # h= y intercept of horizontal line
lty = 5) # "line type"
abline(v = mean(biling_data$wais), lty = 5) # v= x intercept of vertical line
abline(lm(BILING ~ wais, biling_data), # abline can take a lm object to draw regression line
col = "orangered", # there are many colour names that R understands
lwd = 2) # "line width"
# add legend
legend(x = 125, y = 100,
c("Male", "Femle", "Other"), # legend labels
col = c("#fac21888", "#0d5f8a88", "#660a6088"), # colours of points in legend
pch = 17:19, # shapes of points in legend
bty = "n") # "box type" n for no frame around legend
Let’s (mostly) re-create the plot above with ggplot()
now.
First of all, create the plotting space mapping the right variables onto the x and y axes.
aes()
function.
Add the scatter layer.
geom_point()
.
Make the colour and shape of the points dependent on levels of gender
.
geom_point
(or any other layer).
Solution
biling_data %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender))
Let’s get rid of the NA
points by filtering the data so that they don’t include NA
s in the gender
variable before piping it into ggplot()
.
is.na(x)
returns TRUE
if x
is NA
. To negate an expression, you can put a !
in front of it.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender))
Add a little transparency to the points using the alpha=
argument (1 = fully opaque; 0 = fully transparent) and make the points slightly bigger.
alpha=
and size=
arguments so don’t use aes()
.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2)
Add the horizontal dashed line with a y intercept at the mean of BILING
using geom_hline()
. Line geoms in ggplot
take the same lty=
argument as base R
lines.
yintercept=
, it needs to be done within aes()
.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5)
Now add the vertical dashed line with a x intercept at the mean of wais
.
geom_hline()
makes a horizontal line, what geom do you think makes a vertical line?
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5)
One more element to add. Use geom_smooth()
to add a trend line. You don’t need to specify any arguments.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth()
OK, this is a LOESS line with confidence interval ribbon. Check the function documentation (or the ggplot2
reference website to find out how to change it to a linear regression line with no ribbon around it.
While you’re at it, you may as well change the colour of the line.
method=
and se=
arguments.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth(method = "lm", se = F, colour = "orangered")
Use the labs()
layer to change axis labels and add title.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism")
That’s the plot basically finished. All we need to do now is to change the appearance.
The scale_colour_manual()
layer can be used to customise anything to do with the colour aesthetic, including the legend. Use it to get rid of the legend name and change the colours for the gender categories to "#fac218"
, "#0d5f8a"
, and "#660a60"
for males, females and “other”, respectively.
Hint
You only need the name=
and vlaues=
arguments.
""
.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism") +
scale_colour_manual(name = "", values = c("#fac218", "#0d5f8a", "#660a60"))
The plot now has two legends. The reason for this is that, by changing the name=
argument, the legend for colour is now different from the one for shape.
Set scale_shape_manual()
in the same fashion as above. Use values 17, 18, and 19.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism") +
scale_colour_manual(name = "", values = c("#fac218", "#0d5f8a", "#660a60")) +
scale_shape_manual(name = "", values = 17:19)
As you can see, the legends have now merged into one.
Let’s give the plot a little more of a classic look by adding a theme layer.
theme_classic()
layer.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism") +
scale_colour_manual(name = "", values = c("#fac218", "#0d5f8a", "#660a60")) +
scale_shape_manual(name = "", values = 17:19) +
theme_classic()
Looking at the plot, maybe it would be a little better if the dashed lines were behind the points rather than on top of them.
Hide the dashed lines behind the scatter.
Solution
biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism") +
scale_colour_manual(name = "", values = c("#fac218", "#0d5f8a", "#660a60")) +
scale_shape_manual(name = "", values = 17:19) +
theme_classic()
A task for the brave! Can you figure out how to change the position of the legend to make the plot look like this?
Save your lovely plot using ggsave()
.
Solution
# by default ggsave() saves the last plot
ggsave(filename = "my_rad_scatterplot.png")
# you can assign a plot to an object and save it using the object's name
p <- biling_data %>%
dplyr::filter(!is.na(gender)) %>%
ggplot(aes(x = wais, y = BILING)) +
geom_hline(aes(yintercept = mean(BILING)), lty = 5) +
geom_vline(aes(xintercept = mean(wais)), lty = 5) +
geom_point(aes(colour = gender, shape = gender),
alpha = .7, size = 2) +
geom_smooth(method = "lm", se = F, colour = "orangered") +
labs(x = "IQ", y = "Bilingualism score",
title = "Relationship between intelligence and bilingualism") +
scale_colour_manual(name = "", values = c("#fac218", "#0d5f8a", "#660a60")) +
scale_shape_manual(name = "", values = 17:19) +
theme_classic() +
theme(legend.position = c(.9, .9))
ggsave(filename = "my_rad_scatterplot.png", plot = p)
That was quite a lot of plotting! Let’s pause to think about all the things we’ve covered.
In this session, you:
R
plot()
picks different visualisations based on the number and classes of the variables you provide itggplot()
and
geom_point()
for adding, well, points.geom_smooth()
for adding trend linesgeom_hline()
and geom_vline()
for horizontal and vertical lines, respectivelylabs()
scale_..._manual()
theme()
ggplot()
:
aes()
and when not to