This page will show you how you can analyze and visualize single variables using R.
A variable can be described with statistics like the mode, mean, median, variance and standard deviation. Also, it is interesting to visualize a variable using a plot.
Below you will see several examples of obtaining descriptive
statistics for items in a data set called data_new
.
# To summarise some main characteristics of one numerical item (called Item1):
data_new %>%
summarise(mean = mean(Item1), sd = sd(Item1), var = var(Item1), minimum = min(Item1),
maximum = max(Item1))
# You can also ask for only one of the statistics.
# To summarize the main characteristics (mean, median, minimum, maximum, Q1, Q3
# and number of missings) of multiple variables (in this example: the first 10
# variables) in the data set:
data_new %>%
select(1:10) %>%
summary()
# To compute a certain statistic for the first 10 variables in the data set,
# you can also use the map() function:
data_new %>%
select(1:10) %>%
map(var)
# instead of var, you can also use mean, sd, or other functions
The tabyl() function from the janitor
package allows you
to create frequency tables.
# Make sure 'janitor' in loaded into the library
# Create one frequency table
gss_cat %>%
tabyl(race)
# Create frequency tables for all variables of a data set:
gss_cat %>%
map(tabyl)
# If the values are in alphabetical, rather than in a meaningful order, make
# sure the variable is stored as an 'ordered factor' (see chapter 2).
# The frequency table can be made nicer using the adorn commands, for example:
gss_cat %>%
tabyl(marital) %>%
adorn_totals("row") %>%
adorn_pct_formatting()
We will use the package ggplot2
(part of the core
tidyverse) for creating visualizations. The function
ggplot()
is extremely flexible. Below we present some
basics only.
The basic idea of ggplot commands is that you
(1) have a data frame, pipe (%>%
) that into a
(2) ggplot(), add a
(3) geom_…() to select the type of display you want to have (a bar
chart, histogram or a scatterplot), and
(4) use aesthetics (aes()) to select the variable(s) you will use from
that data frame.
There are more ‘layers’ in a ggplot, but these are the most important.
# creating a bar chart using the dataset gss_cat with the nominal variable marital
gss_cat %>%
ggplot(aes(x = marital)) +
geom_bar()
# to create a box plot for the ratio variable tvhours in the dataset gss_cat
# (please note that x = and y = in the aesthetics tilt the picture)
gss_cat %>%
ggplot(aes(y = tvhours)) +
geom_boxplot()
# to create a histogram for the ratio variable tvhours in gss_cat
gss_cat %>%
ggplot(aes(x = tvhours)) +
geom_histogram()
# The aesthetics are often put in ggplot command itself. This is efficient if you combine different geom's (scatter and line, for example) displaying the same variables. Sometimes it makes sense to put the aes() in the geom_ .
gss_cat %>%
ggplot() +
geom_histogram(aes(x = tvhours))
## when creating QQ plots to assess deviations from normality use:
gss_cat %>%
ggplot(aes(sample = tvhours)) +
geom_qq() +
geom_qq_line()
Confidence intervals and tests for proportions, medians or means of one variable can be constructed for dichotomies (proportions), nominal variables (goodness of fit) and interval/ratio variables (means testing).
# For a proportion test, use
n_total <- 1000 # total number of (valid) cases
n_pos <- 690 # number of positives
binom.test(n_pos, n_total, alternative = "two.sided", conf.level = 0.95)
# for confidence interval and two sided test
# For a goodness of fit test (only a test), use
n_observed <- c(166, 142, 80, 160)
p_expected <- c(0.25, 0.25, 0.25, 0.25) # these have to add to 1 (proportions)
g_o_f <- chisq.test(n_observed, p = p_expected)
g_o_f
# For a test of the median (there is no confidence interval here), use
# assuming the data contains a vector with a difference, called data$t2_t1
wilcox.test(data$t2_t1, alternative = "two.sided")
# this can be done using the idea of a signed rank and the linear model too.
# assuming you constructed the signed ranks yourself, use ...
data %>% lm(signed_rank_t2_t1 ~ 1, .)
# For means testing use either t.test or the lm command
# assuming the data of this variable are stored in a data frame as data$var
mu_1 <- 100 # if you want to check whether it is different from 100. When omitted, you test mu_1 = 0.
# using the somewhat old fashioned t-test
t.test(data$var,
mu = mu_1)
# or in the linear model framework:
model <- data %>%
lm((var - mu_1) ~ 1, .)
summary(model) # gives a summary of that estimated model
# when stored as an lm model, the confidence interval can be found by using:
predict.lm(model,
as.data.frame(1),
interval="confidence")
# the confidence interval in the lm() framework can then be found by using
predict.lm(model,
as.data.frame(1),
interval="confidence")
Sometimes you calculate a t-statistic or a chi-square statistic by hand and you want to find the associated p-value, OR you want to know which critical (t or chi-square) value is associated with a specific alpha. These commands allow you to not use ‘tables’.
# For finding the p-value for a specific t-value
q <- 1.96 # the t-value
df <- 499 # the degrees of freedom
pt(q, df, lower.tail = FALSE) # use this when t is positive
# NOTE: this gives the percentage on ONE side
# For finding the critical t-value
p <- 0.025 # the p value (half)
df <- 499 # the degrees of freedom
qt(p, df, lower.tail = FALSE)
chi <- 3.84 # the chi square value
df <- 1 # the degrees of freedom
pchisq(chi, df, lower.tail = FALSE)
# finding the critical value in a chi square distribution
p <- 0.05 # the chi square value
df <- 1 # the degrees of freedom
qchisq(p, df, lower.tail = FALSE)
Testing whether data are coming from a normally distributed population.
# For the Shapiro-Wilk test use:
shapiro.test(mtcars$residuals)
No examples available yet
adorn_pct_formatting()
adorn_totals()
aes()
ggplot()
geom_bar()
geom_boxplot()
geom_histogram()
map()
max()
mean()
median()
min()
sd()
summarise()
tabyl()
var()
When creating a graph, i get the message
Did you use %>% instead of +?
When using the
ggplot()
command together with another command like
geom_bar()
or geom_boxplot()
you don’t use the
pipe (%>%
) between them, but instead you use the
+
. That is because these two commands need to be executed
together, and instead of that the result of one needs to be used in the
next.
ggplot()
function.