In this practice session, we will practise performing basic data cleaning and statistical analyses: We will look at correlations, the t-test, the \(\chi\)2-test and a few regression models. Onwards!

 

Task 1

Open a new R Script file or continue the one from previous practice session.

 

Task 2

Load the tidyverse package.

Solution

library(tidyverse)

Data set

In this session, we will be using a dataset including data on bilingualism, intelligence, handedness, and language proficiency. Here is a quick data dictionary:

Variable name Description
id Participant ID code.
gender Participant’s gender: 0 = male; 1 = female; 2 = other gender.
age Participant’s age in years.
age_L2 Age of acquisition of second language
BILING Bilingualism score: 0 = fully monolingual; 100 = fully bilingual.
wais IQ score (Wechsler Adult Intelligence Scale)
DALF_PASS Pass/fail result on a test of French language proficiency: 1 = pass; 0 = fail.
yearsFR Years spent learning French
ehi_ Score per item on the Edinburgh Handedness Inventory: -10 = total left hand preference; 0 = no hand preference; 10 = total right hand preference.

The dataset in CSV format can be accessed at https://is.gd/exam1718.

 

Task 3

Read in the data into RStudio, saving it under the name biling_data. Use either the base R read.csv() function or its tidyverse couterpart readr::read_csv().

Hint You can either read it straight into RStudio using either of the functions or first save it to your computer and read it in from the local copy.

Solution

# base R way, result is a data.frame
biling_data <- read.csv("https://is.gd/exam1718")
# tidyverse way, result is a tibble
biling_data <- readr::read_csv("https://is.gd/exam1718")

Inspect and clean

 

Task 4

Have a quick look at the first few rows of the data and some basic summary information to familiarise yourself with the set.

Hint head() shows you the first (6, by default) rows of a data frame/tibble.

summary() provides a rough overview of the data.

Solution

head(biling_data)
     id gender age age_L2 BILING     wais DALF_PASS yearsFR ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07 ehi_08 ehi_09 ehi_10
1 DQQTR      0  31      7     72 111.3249         1      18  -10.0    -10    -10  -10.0    -10  -10.0    -10    -10    -10    -10
2 DEJXF      0  33      6     91 113.3173         1       9  -10.0    -10    -10  -10.0    -10  -10.0     10     10    -10     10
3 ZOZNR      1  34      9     43 116.6215         0       8   10.0     10     10   10.0     10   10.0     10     10     10     10
4 CMMZK      1  31      8     63 122.8463         0       8    2.5     10     10    2.5     10    2.5     10     10     10     10
5 MCJWA      0  28      7     79 107.6733         1      17   10.0     10     10   10.0     10   10.0     10     10     10     10
6 FIZVP      1  33      8     65 118.7081         0       9   10.0     10     10   10.0     10   10.0     10     10     10     10
summary(biling_data)
      id               gender               age            age_L2           BILING            wais         DALF_PASS         yearsFR     
 Length:593         Length:593         Min.   : 3.00   Min.   : 5.000   Min.   : 39.00   Min.   :102.4   Min.   :0.0000   Min.   : 3.00  
 Class :character   Class :character   1st Qu.:27.00   1st Qu.: 7.000   1st Qu.: 58.00   1st Qu.:111.8   1st Qu.:0.0000   1st Qu.: 8.00  
 Mode  :character   Mode  :character   Median :30.00   Median : 8.000   Median : 66.00   Median :115.5   Median :0.0000   Median :11.00  
                                       Mean   :30.01   Mean   : 7.766   Mean   : 65.48   Mean   :115.2   Mean   :0.4566   Mean   :11.36  
                                       3rd Qu.:33.00   3rd Qu.: 8.000   3rd Qu.: 72.00   3rd Qu.:118.6   3rd Qu.:1.0000   3rd Qu.:14.00  
                                       Max.   :84.00   Max.   :10.000   Max.   :100.00   Max.   :130.1   Max.   :1.0000   Max.   :25.00  
                                                                                                         NA's   :6                       
     ehi_01            ehi_02            ehi_03            ehi_04            ehi_05            ehi_06            ehi_07       
 Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000  
 1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000  
 Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000  
 Mean   :  3.583   Mean   :  3.558   Mean   :  3.568   Mean   :  3.419   Mean   :  3.537   Mean   :  3.602   Mean   :  3.587  
 3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000  
 Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000  
 NA's   :1         NA's   :2         NA's   :1         NA's   :3                           NA's   :2         NA's   :2        
     ehi_08            ehi_09            ehi_10      
 Min.   :-10.000   Min.   :-10.000   Min.   :-10.00  
 1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.00  
 Median : 10.000   Median : 10.000   Median : 10.00  
 Mean   :  3.609   Mean   :  3.564   Mean   :  3.44  
 3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.00  
 Max.   : 10.000   Max.   : 10.000   Max.   : 10.00  
 NA's   :1         NA's   :1         NA's   :4       

 

All numeric variables are looking OK except for age. There seem to be at least improbably young participant. However, the summary did not give us a good feel for the gender variable. It would have produced a more useful information if the variable were a factor but currently it is just a character vector.

 

Task 5

Get a more useful overview of the gender variable. There are several ways of going about it. You can:

  • Turn the variable into a factor and ask for summary again.
  • Ask R to show you the unique values stored in the gender column.
  • Ask R for a frequency table of the variable.

All of the above can be done either the base R way or the tidyverse way.

Solution

# tidyverse
biling_data %>%
  dplyr::mutate(gender = factor(gender)) %>%
  summary()
      id             gender         age            age_L2           BILING            wais         DALF_PASS         yearsFR     
 Length:593         )   :  1   Min.   : 3.00   Min.   : 5.000   Min.   : 39.00   Min.   :102.4   Min.   :0.0000   Min.   : 3.00  
 Class :character   0   :251   1st Qu.:27.00   1st Qu.: 7.000   1st Qu.: 58.00   1st Qu.:111.8   1st Qu.:0.0000   1st Qu.: 8.00  
 Mode  :character   1   :307   Median :30.00   Median : 8.000   Median : 66.00   Median :115.5   Median :0.0000   Median :11.00  
                    2   :  7   Mean   :30.01   Mean   : 7.766   Mean   : 65.48   Mean   :115.2   Mean   :0.4566   Mean   :11.36  
                    NA's: 27   3rd Qu.:33.00   3rd Qu.: 8.000   3rd Qu.: 72.00   3rd Qu.:118.6   3rd Qu.:1.0000   3rd Qu.:14.00  
                               Max.   :84.00   Max.   :10.000   Max.   :100.00   Max.   :130.1   Max.   :1.0000   Max.   :25.00  
                                                                                                 NA's   :6                       
     ehi_01            ehi_02            ehi_03            ehi_04            ehi_05            ehi_06            ehi_07       
 Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000   Min.   :-10.000  
 1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.000  
 Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000   Median : 10.000  
 Mean   :  3.583   Mean   :  3.558   Mean   :  3.568   Mean   :  3.419   Mean   :  3.537   Mean   :  3.602   Mean   :  3.587  
 3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.000  
 Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000   Max.   : 10.000  
 NA's   :1         NA's   :2         NA's   :1         NA's   :3                           NA's   :2         NA's   :2        
     ehi_08            ehi_09            ehi_10      
 Min.   :-10.000   Min.   :-10.000   Min.   :-10.00  
 1st Qu.:-10.000   1st Qu.:-10.000   1st Qu.:-10.00  
 Median : 10.000   Median : 10.000   Median : 10.00  
 Mean   :  3.609   Mean   :  3.564   Mean   :  3.44  
 3rd Qu.: 10.000   3rd Qu.: 10.000   3rd Qu.: 10.00  
 Max.   : 10.000   Max.   : 10.000   Max.   : 10.00  
 NA's   :1         NA's   :1         NA's   :4       
biling_data %>%
  dplyr::pull(gender) %>%
  unique()
[1] "0" "1" NA  "2" ")"
biling_data %>%
  dplyr::pull(gender) %>%
  table()
.
  )   0   1   2 
  1 251 307   7 

# base R way
# biling_data$gender <- factor(biling_data$gender)
# summary(biling_data)

# unique(biling_data$gender)
# table(biling_data$gender)

 

You can tell from the new summary and the table that there is a single instance of ")" in the variable, most likely a typo.

 

Task 6

Let’s get rid of the value by replacing it with an NA. For this task, let’s stick to base R as the tidyverse way requires conditional evaluation which is a topic we will cover later.

 

Task 6.1

First, use logical subsetting to get to the value you wish to modify.

Solution

biling_data$gender[biling_data$gender == ")"]
 [1] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  ")" NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
# alternatively
# biling_data[biling_data$gender == ")", "gender"]
As you can see the code returned the element of biling_data$gender which is equal to ")" as well as all the NA because, for those, the test biling_data$gender == ")" returns NA and not a TRUE/FALSE. Since we want to replace the closing bracket with NA this does not matter: we will simply replace all of these values with NA. If, however, you want to change your data to some other non-NA value, you need to be mindful of how R deals with NA in logical tests.

 

Task 6.2

Now change the values returned by the code you just wrote to NA.

Hint The code represents a subset of the data so you can manipulate it just like any other data frame. Simply assign NA to it.

Solution

biling_data$gender[biling_data$gender == ")"] <- NA

 

Task 6.3

It’s always a good idea to check the result of your data manipulation. Have a look at the summary of the variable again. Maybe try a different approach this time?

Solution

table(biling_data$gender)

  0   1   2 
251 307   7 

 

Task 7

Let’s deal with the age variable next.

 

Task 7.1

Get rid of data from participants who’s entered age is less than 18.

Hint More logical subsetting if you prefer the base R way or filtering if you’re a tidyverse fan.

Solution

biling_data <- biling_data[biling_data$age > 17, ]

biling_data <- biling_data %>% dplyr::filter(age > 17)

# if you prefer left-to-right flow
biling_data %>% dplyr::filter(age > 17) -> biling_data

 

Task 7.2

Once again, check that your manipulation yielded the desired outcome.

Hint There are numerous ways of checking the variable. For instance, you can ask what its minimum value is, look at its range, or literally ask if any element of the vector is less than 18.

Solution

min(biling_data$age)
[1] 20
range(biling_data$age)
[1] 20 84
# na.rm=T removes NAs before running the test, otherwise the command returns NA
any(biling_data$age < 18, na.rm = T)
[1] FALSE

Basic stats tests

Now that the data is a little cleaner, let’s do some stats!

 

Task 8

First of all, correlations.

 

Task 8.1

Produce a Pearson’s correlation matrix of the EHI items (only complete observations).

Because there are NAs in the data, we need to tell the cor() function how to deal with them. The use= argument gives you several options (e.g., use complete observations or pairwise complete observations, see documentation).
Hint There are several ways to subset only the desired columns of the dataset

Solution

cor(biling_data[ , 9:18], use = "complete.obs")
# pattern matching in base R
cor(biling_data[ , grep("ehi", names(biling_data))], use = "complete.obs")
# pattern matching in tidyverse
biling_data %>%
  dplyr::select(dplyr::matches("ehi")) %>%
  cor(use = "complete.obs")

 

Task 8.2

Perform a significance test of Spearman’s rho coefficient to test for correlation between ehi_01 and ehi_02.

Hint

cor.test(var1, var2, method = )

Solution

cor.test(biling_data$ehi_01, biling_data$ehi_02, method = "spearman")

The rcorr() function from the Hmisc package takes a numeric matrix and returns a list of 3 matrices, a correlation matrix (Pearson or Spearman), a matrix of Ns per comparison, and a matrix of p-values associated with individual coefficients.

 

Task 8.3

Use the Hmisc::rcorr() function on the EHI items, saving the output as ehi_cor.

Hint The subsetted dataset must be converted to a matrix.

Solution

# base R
ehi_cor <- Hmisc::rcorr(as.matrix(biling_data[ , 9:18]))
# tidyverse
ehi_cor <- biling_data %>%
  dplyr::select(dplyr::matches("ehi")) %>%
  as.matrix() %>%
  Hmisc::rcorr()

 

Task 8.4

Pull out the matrix of p-values from ehi_cor.

Hint If in doubt, look at the structure of the object.

Solution

ehi_cor$P
       ehi_01 ehi_02 ehi_03 ehi_04 ehi_05 ehi_06 ehi_07 ehi_08 ehi_09 ehi_10
ehi_01     NA      0      0      0      0      0      0      0      0      0
ehi_02      0     NA      0      0      0      0      0      0      0      0
ehi_03      0      0     NA      0      0      0      0      0      0      0
ehi_04      0      0      0     NA      0      0      0      0      0      0
ehi_05      0      0      0      0     NA      0      0      0      0      0
ehi_06      0      0      0      0      0     NA      0      0      0      0
ehi_07      0      0      0      0      0      0     NA      0      0      0
ehi_08      0      0      0      0      0      0      0     NA      0      0
ehi_09      0      0      0      0      0      0      0      0     NA      0
ehi_10      0      0      0      0      0      0      0      0      0     NA
Since the correlations are all ≥ .95 and Ns are reasonably large, the p-values are all tiny (< 10−16), hence the 0s everywhere.

 

So that’s correlations, let’s move on to the \(\chi\)2-test.

 

Task 9

Run a \(\chi\)2-test (goodness-of-fit test) to see if the distribution of gender categories is uniform.

Hint The chisq.test() function takes a frequency table as its only required argument.

Solution

chisq.test(table(biling_data$gender))

    Chi-squared test for given probabilities

data:  table(biling_data$gender)
X-squared = 269.44, df = 2, p-value < 2.2e-16

 

Task 10

Now let’s run a \(\chi\)2-test of independence looking at the relationship between gender and pass/fail result on the DALF exam.

 

Task 10.1

Run the test storing the result in an object called gender_dalf.

Solution

gender_dalf <- chisq.test(table(biling_data$gender, biling_data$DALF_PASS))

 

Task 10.2

Look at the structure of the object.

Solution

str(gender_dalf)
List of 9
 $ statistic: Named num 1.04
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named int 2
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.593
 $ method   : chr "Pearson's Chi-squared test"
 $ data.name: chr "table(biling_data$gender, biling_data$DALF_PASS)"
 $ observed : 'table' int [1:3, 1:2] 131 165 5 118 139 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "0" "1" "2"
  .. ..$ : chr [1:2] "0" "1"
 $ expected : num [1:3, 1:2] 133.84 163.4 3.76 115.16 140.6 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "0" "1" "2"
  .. ..$ : chr [1:2] "0" "1"
 $ residuals: 'table' num [1:3, 1:2] -0.245 0.125 0.638 0.264 -0.135 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "0" "1" "2"
  .. ..$ : chr [1:2] "0" "1"
 $ stdres   : 'table' num [1:3, 1:2] -0.484 0.272 0.944 0.484 -0.272 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "0" "1" "2"
  .. ..$ : chr [1:2] "0" "1"
 - attr(*, "class")= chr "htest"

 

Task 10.3

Query the object for the value of \(\chi\)2 rounded to 2 decimal places.

Solution

round(gender_dalf$statistic, 2)
X-squared 
     1.04 

 

Task 10.4

Now get just the p-value, rounded to 3 dp.

Solution

round(gender_dalf$p.value, 3)
[1] 0.593

 

Task 11

Let’s move on to the t-test.

 

Task 11.1

Let’s see if there is a difference in age between males and females in the sample. Save the output of the test under age_gen

Hint There are currently three gender categories in the sample so you’ll have to filter/subset the data when running the t-test.

Solution

# tidyverse
age_gen <-
  biling_data %>%
  dplyr::filter(gender != 2) %>%
  t.test(age ~ gender, data = .)

# base R
age_gen <- t.test(age ~ gender,
                  data = biling_data[biling_data$gender != 2, ])

 

Task 11.2

Print out the object to see the results of the t-test.

 

Task 11.3

As you can see, R runs the two-tailed Welch’s t-test by default because it does not rely on the assumption of equal variances. Can you figure out how to run a one-tailed Student’s t-test?

Hint The documentation for the t.test() function should help you with this.

Solution

biling_data %>%
  dplyr::filter(gender != 2) %>%
  t.test(age ~ gender, data = .,
         alternative = "l", # difference is less than 0 (females are older than males)
         var.equal = T) # Student's t-test

    Two Sample t-test

data:  age by gender
t = -0.67468, df = 555, p-value = 0.2501
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf 0.3989591
sample estimates:
mean in group 0 mean in group 1 
       29.98805        30.26471 

Fitting (generalised) linear models

 

Task 12

Finally let’s run some regressions!

 

Task 12.1

Create an intercept only model of BILING called m_0.

Hint In R, an intercept only model is specified as outcome predicted by 1.

Solution

m_0 <- lm(BILING ~ 1, data = biling_data)

 

Task 12.2

Create a model m_1, predicting BILING by age_L2.

Solution

m_1 <- lm(BILING ~ age_L2, data = biling_data)

 

Task 12.3

Create a model m_2, adding wais to the model in m_1.

Solution

m_2 <- lm(BILING ~ age_L2 + wais, data = biling_data)
# alternative way is to update existing model
m_2 <- update(m_1, ~ . + wais)

 

Task 12.4

Finally, create a model m_3, adding an interaction term between the predictors.

Solution

# x1 * x2 is shorthand for x1 + x2 + x1:x2 (interaction)
m_3 <- lm(BILING ~ age_L2 * wais, data = biling_data)
# alternatively, just add the interaction term
m_3 <- update(m_2, ~ . + age_L2:wais)

 

Task 12.5

Perform an F-test, comparing all of our models.

Hint Rather regrettably, the function that performs linear model comparison tests is called anova()

Solution

anova(m_0, m_1, m_2, m_3)
Analysis of Variance Table

Model 1: BILING ~ 1
Model 2: BILING ~ age_L2
Model 3: BILING ~ age_L2 + wais
Model 4: BILING ~ age_L2 + wais + age_L2:wais
  Res.Df   RSS Df Sum of Sq         F Pr(>F)    
1    591 63168                                  
2    590 16715  1     46453 1636.8946 <2e-16 ***
3    589 16701  1        14    0.4763 0.4904    
4    588 16687  1        15    0.5227 0.4700    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Task 12.6

It looks like m_1 is the winner. Let’s look at the summary of the model.

Solution

summary(m_1)

Call:
lm(formula = BILING ~ age_L2, data = biling_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.8746  -3.2342   0.1254   3.7658  13.4062 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 140.3572     1.8620   75.38   <2e-16 ***
age_L2       -9.6404     0.2381  -40.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.323 on 590 degrees of freedom
Multiple R-squared:  0.7354,    Adjusted R-squared:  0.7349 
F-statistic:  1640 on 1 and 590 DF,  p-value: < 2.2e-16

 

Task 12.7

It’s useful to know what information is stored in the model object. Have a look at its structure.

Solution

str(m_1)
List of 12
 $ coefficients : Named num [1:2] 140.36 -9.64
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "age_L2"
 $ residuals    : Named num [1:592] -0.875 8.485 -10.594 -0.234 6.125 ...
  ..- attr(*, "names")= chr [1:592] "1" "2" "3" "4" ...
 $ effects      : Named num [1:592] -1593.232 -215.529 -10.113 -0.107 5.9 ...
  ..- attr(*, "names")= chr [1:592] "(Intercept)" "age_L2" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:592] 72.9 82.5 53.6 63.2 72.9 ...
  ..- attr(*, "names")= chr [1:592] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:592, 1:2] -24.3311 0.0411 0.0411 0.0411 0.0411 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:592] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "age_L2"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.04 1.08
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 590
 $ xlevels      : Named list()
 $ call         : language lm(formula = BILING ~ age_L2, data = biling_data)
 $ terms        :Classes 'terms', 'formula'  language BILING ~ age_L2
  .. ..- attr(*, "variables")= language list(BILING, age_L2)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "BILING" "age_L2"
  .. .. .. ..$ : chr "age_L2"
  .. ..- attr(*, "term.labels")= chr "age_L2"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x000002007c33b498> 
  .. ..- attr(*, "predvars")= language list(BILING, age_L2)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "BILING" "age_L2"
 $ model        :'data.frame':  592 obs. of  2 variables:
  ..$ BILING: int [1:592] 72 91 43 63 79 65 72 55 76 62 ...
  ..$ age_L2: int [1:592] 7 6 9 8 7 8 7 9 7 9 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language BILING ~ age_L2
  .. .. ..- attr(*, "variables")= language list(BILING, age_L2)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "BILING" "age_L2"
  .. .. .. .. ..$ : chr "age_L2"
  .. .. ..- attr(*, "term.labels")= chr "age_L2"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x000002007c33b498> 
  .. .. ..- attr(*, "predvars")= language list(BILING, age_L2)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "BILING" "age_L2"
 - attr(*, "class")= chr "lm"

 

Task 12.8

Ask R to only show you the regression coefficients.

Hint There are two ways of doing this: You can look at the model structure and find the name of the element that contains the coefficients or just use the coef() function on the model.

Solution

coef(m_1)
(Intercept)      age_L2 
 140.357249   -9.640385 
# m_1$coefficients

 

Task 12.9

Can you figure out how to get to the model residuals?

Solution

resid(m_1)
# m_1$residuals

 

Task 12.10

When used on a linear model, the plot() function produces a sequence of standard diagnostic plots. Try running it now. To move on to the next plot, press Enter.

Solution

plot(m_1)

 

Task 13

For the last task, let’s try fitting a logistic model.

 

Task 13.1

Using the glm() (generalised linear model) function, fit a logistic model m_log predicting DALF_PASS by gender, wais, yearsFR, and BILING.

Hint You will need to specify the family= argument to specify the family of error distributions (in our case binomial) and the link function (logit). The binomial() function specifies both by default so you can just pass it to the argument.

binomial()

Family: binomial 
Link function: logit 

Solution

# all of these work
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = binomial)
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = binomial())
m_log <- glm(DALF_PASS ~ gender + wais + yearsFR + BILING, data = biling_data, family = "binomial")

 

Task 13.2

Pass the model to the anova() function along with the test = "LRT" argument to get a likelihood ratio test for each sequentially added predictor.

Solution

anova(m_log, test = "LRT")
Analysis of Deviance Table

Model: binomial, link: logit

Response: DALF_PASS

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                      559     773.17             
gender   2    1.081       557     772.09   0.5825    
wais     1    2.139       556     769.95   0.1436    
yearsFR  1  219.267       555     550.69   <2e-16 ***
BILING   1  156.139       554     394.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Task 13.3

Look at the model summary.

Solution

summary(m_log)

Call:
glm(formula = DALF_PASS ~ gender + wais + yearsFR + BILING, family = "binomial", 
    data = biling_data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.53712  -0.52971  -0.09647   0.45444   2.57087  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -36.94642    4.43629  -8.328  < 2e-16 ***
gender1      -0.14627    0.25580  -0.572    0.567    
gender2      -1.66909    1.29827  -1.286    0.199    
wais          0.15534    0.02934   5.294 1.19e-07 ***
yearsFR       0.59747    0.05378  11.109  < 2e-16 ***
BILING        0.18526    0.01961   9.447  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 773.17  on 559  degrees of freedom
Residual deviance: 394.55  on 554  degrees of freedom
  (32 observations deleted due to missingness)
AIC: 406.55

Number of Fisher Scoring iterations: 6

 

Task 13.4

Applying the exponential function (ex) converts the regression coefficients into odds ratios. Try doing it.

Hint The function’s name in R is exp().

Solution

exp(coef(m_log))
 (Intercept)      gender1      gender2         wais      yearsFR       BILING 
9.002685e-17 8.639216e-01 1.884185e-01 1.168055e+00 1.817521e+00 1.203536e+00 

Reflect

Completing these tasks you practised:

  • Reading in, inspecting, and cleaning data
  • Running basic statistical tests and negotiating the output
  • Fitting linear and logistic models
  • Doing basic model diagnostics.

Well done!