This page will show you how you can analyze relationships between more than two variables using R.

Explanation

3-way contingency tables: tabyl

Use the tabyl() function from the janitor package. With several adorning functions you can adjust the table to your liking. You can force the table displaying the values ‘none’, ‘some’ and ‘many’ in the correct order, by making sure the variable is stored as an ‘ordered factor’ (see chapter 2).

library(janitor)

mtcars %>% 
  tabyl(am, gear, cyl, show_na = FALSE) %>% 
  adorn_title("combined") %>% #both var names in the title
  adorn_totals("col") %>% #column totals
  adorn_totals("row") %>% #row totals
  adorn_percentages("col") %>% #columnwise, rowwise (row), or total percentages (all)
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns() #show the numbers of cases

Groups in a scatterplot and facetting: ggplot

If you want to visualize different groups in the scatterplot, you can add extra variables to the aesthetics of a ggplot. In addition to x and y, you can use color (the “outside” color of points), fill (the “inside” color of the points), shape (of the points) and size.

mtcars %>% 
  ggplot(aes(x = cyl, 
             y = mpg, 
             color = factor(gear)
             )
         ) + 
  geom_point()

For different subplots, use the layer facet_wrap(~ ) as default. This is another way to introduce a third variable in the context of data visualizations.

mpg %>% 
  ggplot(aes(x = factor(cyl), y = hwy)) +
  geom_boxplot() + 
  facet_wrap(~ year)

Linear models

For the ordinary linear model, we use lm(). For a clear presentation of the regression table, we use the tidy() function from the broom package.

library(broom)

model <- mtcars %>% 
  lm(qsec ~ wt + cyl, data = .) 

# the regression table  
model %>% 
  tidy()

# the ANOVA table
model %>% 
  anova() %>% 
  tidy()

# overall output, including R squared, is provided by
model %>% 
  summary()

Diagnostics: residuals, equal variances, outliers and influence

There are various ways to create plots with the residuals and/or the predicted values.

# always create a model you want to diagnose first

model <- mtcars %>% 
  lm(qsec ~ wt + cyl, data = .) 

# the object 'model' now contains the residuals and the predicted values.
# residuals and predicted values can  be added to the data frame in two ways:

# directly
mtcars$residuals <- model$residuals
mtcars$predictions <- model$fitted.values

# or by using the modelr package:
library(modelr)
mtcars <- mtcars %>% 
  add_predictions(model) %>% 
  add_residuals(model)

# after adding the residuals and the predicted values displaying residuals plots is done in a ggplot
mtcars %>% 
  ggplot(aes(x = pred, y = resid)) + 
  geom_point()

There are various ways to test for equal variances in a model. These are available in various packages, including car and in lmtest. Levene’s test is introduced in chapter 4.

# Breusch-Pagan Test
library(lmtest)

# estimate the model first and then:
bptest(model)

For outliers and influential cases, use:

# After estimating and storing a model, leverage as measured by the hat values can be stored to the original data frame, by using
data$leverage_cases_model_1 <- hatvalues(model_1)

# Also, one of the plots of the model contains info about Cook's distances
plot(model_1)

# After estimating and storing the model, influence as measured by cooks distances can be stored to the original data frame, by using
data$influential_cases_model_1 <- cooks.distance(model_1)

# Also, one of the plots of the model contains info about Cook's distances
plot(model_1)

After estimating the model, detecting multicollinearity can be done using the car library.

library(car)
vif(model)

linear mixed models

For the linear mixed model, we use the lmer() function from the lme4 package. Note that the output does not show p-values, nor residual degrees of freedom for fixed effects. This is for a good reason.

library(lme4)

mtcars %>% 
  lmer(qsec ~ wt + (1|gear), data = .) %>% 
  tidy()

# When we have factors as fixed variables:
mtcars %>% 
  lmer(qsec ~ wt + factor(cyl) + (1|gear), data = .) %>% 
  anova() %>% 
  tidy()

If you want approximate p-values, you can use Satterthwaite’s degrees of freedom method, implemented in the lmerTest package. The same method is used in SPSS.

library(lmerTest)
mtcars %>% 
  lmer(qsec ~ wt + (1|gear), data = .) %>% 
  summary()

For a residual plot, we can use similar syntax as for the linear model, using modelr:

model <- mtcars %>% 
  lmer(qsec ~ wt + (1|gear), data = .)  

mtcars %>% 
  add_predictions(model) %>% 
  add_residuals(model) %>% 
  ggplot(aes(x = pred, y = resid)) +
  geom_point()

logistic regression

For a logistic regression model, we use the glm() function.

model_lr <- mtcars %>% 
  glm(am ~ wt, family = binomial, data = .) 
broom::tidy(model_lr)
summary(model_lr)

# when you want to find out the probabilities for units with scores 2, 3 and 5 on the wt variable:
wt <- c(0, 18, 72) # make sure "x" is the same name as the independent variable in the data set
new_data <- as.data.frame(wt) 
predict(model_lr, new_data, type="response")

Similarly for a Poisson regression.

mtcars %>% 
  glm(carb ~ wt, family = poisson, data = .) %>% 
  tidy()

Nonparametric tests for repeated measures

For a non-parametric test for repeated measures, we use

iris %>% 
  select(Sepal.Length, Petal.Length) %>% 
  as.matrix() %>%
  friedman.test()

Examples

No examples available yet

Functions

No function overview yet

FAQ

No Frequently Asked Questions yet

Resources

No resources yet

Multivariate data analysis