This page will show you how you can analyze and visualize single variables using R.

Explanation

A variable can be described with statistics like the mode, mean, median, variance and standard deviation. Also, it is interesting to visualize a variable using a plot.

Summarizing numerical variables: mean, median, variation and standard deviation

Below you will see several examples of obtaining descriptive statistics for items in a data set called data_new.

# To summarise some main characteristics of one numerical item (called Item1):
data_new %>%
    summarise(mean = mean(Item1), sd = sd(Item1), var = var(Item1), minimum = min(Item1),
        maximum = max(Item1))
# You can also ask for only one of the statistics.

# To summarize the main characteristics (mean, median, minimum, maximum, Q1, Q3
# and number of missings) of multiple variables (in this example: the first 10
# variables) in the data set:
data_new %>%
    select(1:10) %>%
    summary()

# To compute a certain statistic for the first 10 variables in the data set,
# you can also use the map() function:

data_new %>%
    select(1:10) %>%
    map(var)

# instead of var, you can also use mean, sd, or other functions

Frequency tables: tabyl

The tabyl() function from the janitor package allows you to create frequency tables.

# Make sure 'janitor' in loaded into the library

# Create one frequency table
gss_cat %>%
    tabyl(race)

# Create frequency tables for all variables of a data set:
gss_cat %>%
    map(tabyl)

# If the values are in alphabetical, rather than in a meaningful order, make
# sure the variable is stored as an 'ordered factor' (see chapter 2).

# The frequency table can be made nicer using the adorn commands, for example:
gss_cat %>%
    tabyl(marital) %>%
    adorn_totals("row") %>%
    adorn_pct_formatting()

Univariate graphs: bar charts, box plots and histograms

We will use the package ggplot2 (part of the core tidyverse) for creating visualizations. The function ggplot() is extremely flexible. Below we present some basics only.

The basic idea of ggplot commands is that you
(1) have a data frame, pipe (%>%) that into a
(2) ggplot(), add a
(3) geom_…() to select the type of display you want to have (a bar chart, histogram or a scatterplot), and
(4) use aesthetics (aes()) to select the variable(s) you will use from that data frame.

There are more ‘layers’ in a ggplot, but these are the most important.

# creating a bar chart using the dataset gss_cat with the nominal variable marital
gss_cat %>% 
  ggplot(aes(x = marital)) +
  geom_bar()

# to create a box plot for the ratio variable tvhours in the dataset gss_cat 
# (please note that x = and y = in the aesthetics tilt the picture)
gss_cat %>% 
  ggplot(aes(y = tvhours)) +
  geom_boxplot()

# to create a histogram for the ratio variable tvhours in gss_cat
gss_cat %>% 
  ggplot(aes(x = tvhours)) +
  geom_histogram()

# The aesthetics are often put in ggplot command itself. This is efficient if you combine different geom's (scatter and line, for example) displaying the same variables. Sometimes it makes sense to put the aes() in the geom_ .
gss_cat %>% 
  ggplot() +
  geom_histogram(aes(x = tvhours))

## when creating QQ plots to assess deviations from normality use:
gss_cat %>% 
  ggplot(aes(sample = tvhours)) +
  geom_qq() +
  geom_qq_line()

Univariate inferential statistics

Confidence intervals and tests for proportions, medians or means of one variable can be constructed for dichotomies (proportions), nominal variables (goodness of fit) and interval/ratio variables (means testing).

# For a proportion, use
n_total <- 1000 # total number of (valid) cases
n_pos <- 690 # number of positives
binom.test(n_pos, n_total, alternative = "two.sided", conf.level = 0.95) 
# for confidence interval and two sided test

# For a goodness of fit test (only a test), use
n_observed <- c(166, 142, 80, 160)
p_expected <- c(0.25, 0.25, 0.25, 0.25) # these have to add to 1 (proportions)
g_o_f <- chisq.test(n_observed, p = p_expected)
g_o_f

# For a test of the median (there is no confidence interval here), use
# assuming the data contains a vector with a difference, called data$t2_t1
wilcox.test(data$t2_t1, alternative = "two.sided")
# this can be done using the idea of a signed rank and the linear model too.
# assuming you constructed the signed ranks yourself, use ...
data %>% lm(signed_rank_t2_t1 ~ 1, .)

# For means testing use either t.test or the lm command
# Assuming the data are stored in a data frame as data$var
mu_1 <- 100 # if you want to check whether it is different from 100

# using the somewhat old fashioned t-test
t.test(data$var, mu = mu_1)

# using the linear model framework:
model <- data %>% 
  lm((var - mu_1) ~ 1, .)

model(summary)

# the confidence interval in the lm() framework can then be found by using
predict.lm(model, 
           as.data.frame(1), 
           interval="confidence")

Sometimes you calculate a t-statistic or a chi-square statistic by hand and you want to find the associated p-value, OR you want to know which critical (t or chi-square) value is associated with a specific alpha. These commands allow you to not use ‘tables’.

# For finding the p-value for a specific t-value
q <- 1.96 # the t-value
df <- 499 # the degrees of freedom
pt(q, df, lower.tail = FALSE)
# NOTE: this gives the percentage on ONE side

# For finding the critical t-value
p <- 0.025 # the p value (half)
df <- 499 # the degrees of freedom
qt(p, df, lower.tail = FALSE)

chi <- 3.84 # the chi square value
df <- 1 # the degrees of freedom
pchisq(chi, df, lower.tail = FALSE)

# finding the critical value in a chi square distribution
p <- 0.05 # the chi square value
df <- 1 # the degrees of freedom
qchisq(p, df, lower.tail = FALSE)

Testing whether data are coming from a normally distributed population.

# For the Shapiro-Wilk test use:
shapiro.test(mtcars$residuals)

Examples

No examples available yet

Functions

  • adorn_pct_formatting()
    package: janitor
    Change numeric columns to percentages in a frequency table (tabyl).
  • adorn_totals()
    package: janitor
    Add a row or column with totals to a frequency table (tabyl).
  • aes()
    package: ggplot2 (tidyverse)
    Create an aestetic map, which explains which visual properties (aesthetics) are linked (mapped) to which variables in the dataframe.
  • ggplot()
    package: ggplot2 (tidyverse)
    Initialize any type of graph.
  • geom_bar()
    package: ggplot2 (tidyverse)
    Create a bar chart using the initialized ggplot.
  • geom_boxplot()
    package: ggplot2 (tidyverse)
    Create a boxplot using the initialized ggplot.
  • geom_histogram()
    package: ggplot2 (tidyverse)
    Create a histogram using the initialized ggplot.
  • map()
    package: purrr (tidyverse)
    Apply a transformation to every variable in the dataframe.
  • max()
    package: base
    Find the highest number from a list of numbers.
  • mean()
    package: utils (base)
    Find the mean of a list of numbers.
  • median()
    package: stats (base)
    Find the median of a list of numbers.
  • min()
    package: base
    Find the lowest number from a list of numbers.
  • sd()
    package: stats (base)
    Find the standard deviation of a list of numbers.
  • summarise()
    package: dplyr (tidyverse)
    Create a new dataframe summarizing the values in another dataframe.
  • tabyl()
    package: janitor
    Generate a frequency table.
  • var()
    package: stats (base)
    Find the variation of a list of numbers.

FAQ

When creating a graph, i get the message Did you use %>% instead of +? When using the ggplot() command together with another command like geom_bar() or geom_boxplot() you don’t use the pipe (%>%) between them, but instead you use the +. That is because these two commands need to be executed together, and instead of that the result of one needs to be used in the next.

Resources