In this practice session, we will practise performing basic data cleaning and statistical analyses: We will look at correlations, the t-test, the \(\chi\)2-test and a few regression models. Onwards!
Open a new R
Script file or continue the one from previous practice session.
In this session, we will be using a dataset including data on bilingualism, intelligence, handedness, and language proficiency. Here is a quick data dictionary:
Variable name | Description |
---|---|
id |
Participant ID code. |
gender |
Participant’s gender: 0 = male; 1 = female; 2 = other gender. |
age |
Participant’s age in years. |
age_L2 |
Age of acquisition of second language |
BILING |
Bilingualism score: 0 = fully monolingual; 100 = fully bilingual. |
wais |
IQ score (Wechsler Adult Intelligence Scale) |
DALF_PASS |
Pass/fail result on a test of French language proficiency: 1 = pass; 0 = fail. |
yearsFR |
Years spent learning French |
ehi_ |
Score per item on the Edinburgh Handedness Inventory: -10 = total left hand preference; 0 = no hand preference; 10 = total right hand preference. |
The dataset in CSV format can be accessed at https://is.gd/exam1718.
Read in the data into RStudio, saving it under the name biling_data
. Use either the base R
read.csv()
function or its tidyverse
couterpart readr::read_csv()
.
Have a quick look at the first few rows of the data and some basic summary information to familiarise yourself with the set.
Hint
head()
shows you the first (6, by default) rows of a data frame/tibble.
summary()
provides a rough overview of the data.
Solution
head(biling_data)
id gender age age_L2 BILING wais DALF_PASS yearsFR ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07 ehi_08 ehi_09 ehi_10
1 DQQTR 0 31 7 72 111.3249 1 18 -10.0 -10 -10 -10.0 -10 -10.0 -10 -10 -10 -10
2 DEJXF 0 33 6 91 113.3173 1 9 -10.0 -10 -10 -10.0 -10 -10.0 10 10 -10 10
3 ZOZNR 1 34 9 43 116.6215 0 8 10.0 10 10 10.0 10 10.0 10 10 10 10
4 CMMZK 1 31 8 63 122.8463 0 8 2.5 10 10 2.5 10 2.5 10 10 10 10
5 MCJWA 0 28 7 79 107.6733 1 17 10.0 10 10 10.0 10 10.0 10 10 10 10
6 FIZVP 1 33 8 65 118.7081 0 9 10.0 10 10 10.0 10 10.0 10 10 10 10
summary(biling_data)
id gender age age_L2 BILING wais DALF_PASS yearsFR
Length:593 Length:593 Min. : 3.00 Min. : 5.000 Min. : 39.00 Min. :102.4 Min. :0.0000 Min. : 3.00
Class :character Class :character 1st Qu.:27.00 1st Qu.: 7.000 1st Qu.: 58.00 1st Qu.:111.8 1st Qu.:0.0000 1st Qu.: 8.00
Mode :character Mode :character Median :30.00 Median : 8.000 Median : 66.00 Median :115.5 Median :0.0000 Median :11.00
Mean :30.01 Mean : 7.766 Mean : 65.48 Mean :115.2 Mean :0.4566 Mean :11.36
3rd Qu.:33.00 3rd Qu.: 8.000 3rd Qu.: 72.00 3rd Qu.:118.6 3rd Qu.:1.0000 3rd Qu.:14.00
Max. :84.00 Max. :10.000 Max. :100.00 Max. :130.1 Max. :1.0000 Max. :25.00
NA's :6
ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07
Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000
1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000
Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000
Mean : 3.583 Mean : 3.558 Mean : 3.568 Mean : 3.419 Mean : 3.537 Mean : 3.602 Mean : 3.587
3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000
Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000
NA's :1 NA's :2 NA's :1 NA's :3 NA's :2 NA's :2
ehi_08 ehi_09 ehi_10
Min. :-10.000 Min. :-10.000 Min. :-10.00
1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.00
Median : 10.000 Median : 10.000 Median : 10.00
Mean : 3.609 Mean : 3.564 Mean : 3.44
3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.00
Max. : 10.000 Max. : 10.000 Max. : 10.00
NA's :1 NA's :1 NA's :4
All numeric variables are looking OK except for age
. There seem to be at least improbably young participant. However, the summary did not give us a good feel for the gender
variable. It would have produced a more useful information if the variable were a factor but currently it is just a character vector.
Get a more useful overview of the gender
variable. There are several ways of going about it. You can:
R
to show you the unique values stored in the gender
column.R
for a frequency table of the variable.All of the above can be done either the base R
way or the tidyverse
way.
Solution
# tidyverse
biling_data %>%
dplyr::mutate(gender = factor(gender)) %>%
summary()
id gender age age_L2 BILING wais DALF_PASS yearsFR
Length:593 ) : 1 Min. : 3.00 Min. : 5.000 Min. : 39.00 Min. :102.4 Min. :0.0000 Min. : 3.00
Class :character 0 :251 1st Qu.:27.00 1st Qu.: 7.000 1st Qu.: 58.00 1st Qu.:111.8 1st Qu.:0.0000 1st Qu.: 8.00
Mode :character 1 :307 Median :30.00 Median : 8.000 Median : 66.00 Median :115.5 Median :0.0000 Median :11.00
2 : 7 Mean :30.01 Mean : 7.766 Mean : 65.48 Mean :115.2 Mean :0.4566 Mean :11.36
NA's: 27 3rd Qu.:33.00 3rd Qu.: 8.000 3rd Qu.: 72.00 3rd Qu.:118.6 3rd Qu.:1.0000 3rd Qu.:14.00
Max. :84.00 Max. :10.000 Max. :100.00 Max. :130.1 Max. :1.0000 Max. :25.00
NA's :6
ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07
Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000 Min. :-10.000
1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.000
Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000 Median : 10.000
Mean : 3.583 Mean : 3.558 Mean : 3.568 Mean : 3.419 Mean : 3.537 Mean : 3.602 Mean : 3.587
3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.000
Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000 Max. : 10.000
NA's :1 NA's :2 NA's :1 NA's :3 NA's :2 NA's :2
ehi_08 ehi_09 ehi_10
Min. :-10.000 Min. :-10.000 Min. :-10.00
1st Qu.:-10.000 1st Qu.:-10.000 1st Qu.:-10.00
Median : 10.000 Median : 10.000 Median : 10.00
Mean : 3.609 Mean : 3.564 Mean : 3.44
3rd Qu.: 10.000 3rd Qu.: 10.000 3rd Qu.: 10.00
Max. : 10.000 Max. : 10.000 Max. : 10.00
NA's :1 NA's :1 NA's :4
biling_data %>%
dplyr::pull(gender) %>%
unique()
[1] "0" "1" NA "2" ")"
biling_data %>%
dplyr::pull(gender) %>%
table()
.
) 0 1 2
1 251 307 7
# base R way
# biling_data$gender <- factor(biling_data$gender)
# summary(biling_data)
# unique(biling_data$gender)
# table(biling_data$gender)
You can tell from the new summary and the table that there is a single instance of ")"
in the variable, most likely a typo.
Let’s get rid of the value by replacing it with an NA
. For this task, let’s stick to base R
as the tidyverse
way requires conditional evaluation which is a topic we will cover later.
First, use logical subsetting to get to the value you wish to modify.
Solution
biling_data$gender
which is equal to ")"
as well as all the NA
because, for those, the test biling_data$gender == ")"
returns NA
and not a TRUE/FALSE
. Since we want to replace the closing bracket with NA
this does not matter: we will simply replace all of these values with NA
. If, however, you want to change your data to some other non-NA
value, you need to be mindful of how R
deals with NA
in logical tests.
Now change the values returned by the code you just wrote to NA
.
NA
to it.
Let’s deal with the age
variable next.
Get rid of data from participants who’s entered age is less than 18.
R
way or filtering if you’re a tidyverse
fan.
Solution
Once again, check that your manipulation yielded the desired outcome.
Now that the data is a little cleaner, let’s do some stats!
First of all, correlations.
Produce a Pearson’s correlation matrix of the EHI items (only complete observations).
NA
s in the data, we need to tell the cor()
function how to deal with them. The use=
argument gives you several options (e.g., use complete observations or pairwise complete observations, see documentation).
Solution
Perform a significance test of Spearman’s rho coefficient to test for correlation between ehi_01
and ehi_02
.
The rcorr()
function from the Hmisc
package takes a numeric matrix and returns a list of 3 matrices, a correlation matrix (Pearson or Spearman), a matrix of Ns per comparison, and a matrix of p-values associated with individual coefficients.
Use the Hmisc::rcorr()
function on the EHI items, saving the output as ehi_cor
.
Solution
Pull out the matrix of p-values from ehi_cor
.
Solution
ehi_cor$P
ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07 ehi_08 ehi_09 ehi_10
ehi_01 NA 0 0 0 0 0 0 0 0 0
ehi_02 0 NA 0 0 0 0 0 0 0 0
ehi_03 0 0 NA 0 0 0 0 0 0 0
ehi_04 0 0 0 NA 0 0 0 0 0 0
ehi_05 0 0 0 0 NA 0 0 0 0 0
ehi_06 0 0 0 0 0 NA 0 0 0 0
ehi_07 0 0 0 0 0 0 NA 0 0 0
ehi_08 0 0 0 0 0 0 0 NA 0 0
ehi_09 0 0 0 0 0 0 0 0 NA 0
ehi_10 0 0 0 0 0 0 0 0 0 NA
So that’s correlations, let’s move on to the \(\chi\)2-test.
Run a \(\chi\)2-test (goodness-of-fit test) to see if the distribution of gender categories is uniform.
chisq.test()
function takes a frequency table as its only required argument.
Solution
Now let’s run a \(\chi\)2-test of independence looking at the relationship between gender and pass/fail result on the DALF exam.
Run the test storing the result in an object called gender_dalf
.
Look at the structure of the object.
Solution
str(gender_dalf)
List of 9
$ statistic: Named num 1.04
..- attr(*, "names")= chr "X-squared"
$ parameter: Named int 2
..- attr(*, "names")= chr "df"
$ p.value : num 0.593
$ method : chr "Pearson's Chi-squared test"
$ data.name: chr "table(biling_data$gender, biling_data$DALF_PASS)"
$ observed : 'table' int [1:3, 1:2] 131 165 5 118 139 2
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:2] "0" "1"
$ expected : num [1:3, 1:2] 133.84 163.4 3.76 115.16 140.6 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:2] "0" "1"
$ residuals: 'table' num [1:3, 1:2] -0.245 0.125 0.638 0.264 -0.135 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:2] "0" "1"
$ stdres : 'table' num [1:3, 1:2] -0.484 0.272 0.944 0.484 -0.272 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "0" "1" "2"
.. ..$ : chr [1:2] "0" "1"
- attr(*, "class")= chr "htest"
Query the object for the value of \(\chi\)2 rounded to 2 decimal places.
Let’s move on to the t-test.
Let’s see if there is a difference in age between males and females in the sample. Save the output of the test under age_gen
Solution
Print out the object to see the results of the t-test.
As you can see, R
runs the two-tailed Welch’s t-test by default because it does not rely on the assumption of equal variances. Can you figure out how to run a one-tailed Student’s t-test?
t.test()
function should help you with this.
Solution
biling_data %>%
dplyr::filter(gender != 2) %>%
t.test(age ~ gender, data = .,
alternative = "l", # difference is less than 0 (females are older than males)
var.equal = T) # Student's t-test
Two Sample t-test
data: age by gender
t = -0.67468, df = 555, p-value = 0.2501
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 0.3989591
sample estimates:
mean in group 0 mean in group 1
29.98805 30.26471
Finally let’s run some regressions!
Create an intercept only model of BILING
called m_0
.
R
, an intercept only model is specified as outcome predicted by 1.
Create a model m_1
, predicting BILING
by age_L2
.
Create a model m_2
, adding wais
to the model in m_1
.
Solution
Finally, create a model m_3
, adding an interaction term between the predictors.
Solution
Perform an F-test, comparing all of our models.
anova()
Solution
anova(m_0, m_1, m_2, m_3)
Analysis of Variance Table
Model 1: BILING ~ 1
Model 2: BILING ~ age_L2
Model 3: BILING ~ age_L2 + wais
Model 4: BILING ~ age_L2 + wais + age_L2:wais
Res.Df RSS Df Sum of Sq F Pr(>F)
1 591 63168
2 590 16715 1 46453 1636.8946 <2e-16 ***
3 589 16701 1 14 0.4763 0.4904
4 588 16687 1 15 0.5227 0.4700
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like m_1
is the winner. Let’s look at the summary of the model.
Solution
summary(m_1)
Call:
lm(formula = BILING ~ age_L2, data = biling_data)
Residuals:
Min 1Q Median 3Q Max
-18.8746 -3.2342 0.1254 3.7658 13.4062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 140.3572 1.8620 75.38 <2e-16 ***
age_L2 -9.6404 0.2381 -40.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.323 on 590 degrees of freedom
Multiple R-squared: 0.7354, Adjusted R-squared: 0.7349
F-statistic: 1640 on 1 and 590 DF, p-value: < 2.2e-16
It’s useful to know what information is stored in the model object. Have a look at its structure.
Solution
str(m_1)
List of 12
$ coefficients : Named num [1:2] 140.36 -9.64
..- attr(*, "names")= chr [1:2] "(Intercept)" "age_L2"
$ residuals : Named num [1:592] -0.875 8.485 -10.594 -0.234 6.125 ...
..- attr(*, "names")= chr [1:592] "1" "2" "3" "4" ...
$ effects : Named num [1:592] -1593.232 -215.529 -10.113 -0.107 5.9 ...
..- attr(*, "names")= chr [1:592] "(Intercept)" "age_L2" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:592] 72.9 82.5 53.6 63.2 72.9 ...
..- attr(*, "names")= chr [1:592] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:592, 1:2] -24.3311 0.0411 0.0411 0.0411 0.0411 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:592] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "age_L2"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.04 1.08
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 590
$ xlevels : Named list()
$ call : language lm(formula = BILING ~ age_L2, data = biling_data)
$ terms :Classes 'terms', 'formula' language BILING ~ age_L2
.. ..- attr(*, "variables")= language list(BILING, age_L2)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "BILING" "age_L2"
.. .. .. ..$ : chr "age_L2"
.. ..- attr(*, "term.labels")= chr "age_L2"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x000002007c33b498>
.. ..- attr(*, "predvars")= language list(BILING, age_L2)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "BILING" "age_L2"
$ model :'data.frame': 592 obs. of 2 variables:
..$ BILING: int [1:592] 72 91 43 63 79 65 72 55 76 62 ...
..$ age_L2: int [1:592] 7 6 9 8 7 8 7 9 7 9 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language BILING ~ age_L2
.. .. ..- attr(*, "variables")= language list(BILING, age_L2)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "BILING" "age_L2"
.. .. .. .. ..$ : chr "age_L2"
.. .. ..- attr(*, "term.labels")= chr "age_L2"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x000002007c33b498>
.. .. ..- attr(*, "predvars")= language list(BILING, age_L2)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "BILING" "age_L2"
- attr(*, "class")= chr "lm"
Ask R
to only show you the regression coefficients.
coef()
function on the model.
For the last task, let’s try fitting a logistic model.
Using the glm()
(generalised linear model) function, fit a logistic model m_log
predicting DALF_PASS
by gender
, wais
, yearsFR
, and BILING
.
Hint
You will need to specify the family=
argument to specify the family of error distributions (in our case binomial) and the link function (logit). The binomial()
function specifies both by default so you can just pass it to the argument.
Solution
# all of these work
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = binomial)
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = binomial())
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = "binomial")
Pass the model to the anova()
function along with the test = "LRT"
argument to get a likelihood ratio test for each sequentially added predictor.
Solution
anova(m_log, test = "LRT")
Analysis of Deviance Table
Model: binomial, link: logit
Response: DALF_PASS
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 559 773.17
gender 2 1.081 557 772.09 0.5825
wais 1 2.139 556 769.95 0.1436
yearsFR 1 219.267 555 550.69 <2e-16 ***
BILING 1 156.139 554 394.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Look at the model summary.
Solution
summary(m_log)
Call:
glm(formula = DALF_PASS ~ gender + wais + yearsFR + BILING, family = "binomial",
data = biling_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.53712 -0.52971 -0.09647 0.45444 2.57087
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -36.94642 4.43629 -8.328 < 2e-16 ***
gender1 -0.14627 0.25580 -0.572 0.567
gender2 -1.66909 1.29827 -1.286 0.199
wais 0.15534 0.02934 5.294 1.19e-07 ***
yearsFR 0.59747 0.05378 11.109 < 2e-16 ***
BILING 0.18526 0.01961 9.447 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 773.17 on 559 degrees of freedom
Residual deviance: 394.55 on 554 degrees of freedom
(32 observations deleted due to missingness)
AIC: 406.55
Number of Fisher Scoring iterations: 6
Applying the exponential function (ex) converts the regression coefficients into odds ratios. Try doing it.
R
is exp()
.
Completing these tasks you practised:
Well done!