This page will show you how you van import different types of datasets in R and how you can process its data.
Creating your own data set can be done in many ways. We mention two:
#creating a data frame by variable
age <- c(44, 30, 20, 67) %>% as.integer()
gender <- c("male", "female", "male", "female") %>% as.factor()
length <- c(1.67, 1.70, 1.80, 1.81) %>% as.numeric()
membership <- c("T", "F", "T", "T") %>% as.logical()
# merge the variables
dataset_2 <- data.frame(age, gender, length, membership)
# creating a data frame with a random variable based on the normal
# distribution, assumed mean of the normal distribution is 50, sd is 20, with
# 100 units.
n <- 100
dataset_3 <- rnorm(n, 50, 20) %>%
as.data.frame()
To import a data set, you first need to activate the correct package.
These pacakges are readr
(for csv files, although this is
part of the tidyverse core and is thus installed when using
library("tidyverse")
), foreign
(for SPSS
files), and readxl
(for excel files, also part of the
tidyverse, but not of the core, so it needs to be called separately).
Next to that, you need to know the location of the data set on your
computer.
Suppose you have two version of a data file called ‘dataset_4’ both located in “C:/My Documents”. The data sets can be loaded into the global environment with the following commands:
# set working directory to the correct folder and put all files to be imported
# in that folder
install.packages("tidyverse", "foreign", "haven") # you do this only once every few months or so
library("tidyverse", "foreign", "readxl", "haven") # you need to do that every time after (re)starting R
setwd("C:/My Documents") # setting the working directory
# load a csv data file, in this case separated by comma's, with the package
# 'readr' or with standard tidyverse
data_1a <- read.csv("dataset_4.csv", sep = ",")
data_1b <- read_csv("dataset_4.csv")
# load an spss file with the package foreign
data_2 <- read.spss("dataset_4.sav", to.data.frame = TRUE)
# load an spss file with the package haven
data_3 <- read_spss("dataset_4.sav")
Sometimes imported datasets are ‘labeled’. For example, the variable ‘gender’ is stored as a series of 1’s and 2’s and a label is added for the variable (‘gender as derived from the sampling frame’) and for the values (1 means ‘woman’ and 2 means ‘man’). The labels are attributes of a data frame.
To inspect the labels in a dataframe:
# finding the variable label
data$variable %>%
attr("label")
# finding the value labels
data$variable %>%
attr("labels")
To illustrate the commands used below, we will use the data set gss_cat (from the tidyverse package forcats). This is a sample of data from the general social survey in the US. It contains the variables: year (year of survey, 2000–2014); age (Maximum age truncated to 89); marital (marital status); race (race); rincome (reported income); partyid (party affiliation); relig (religion); denom (denomination); tvhours (hours per day watching tv).
Suppose we have a data frame and you want to inspect it.
# View the data from the data set dataset4 in a spreadsheet.
gss_cat %>%
View() # mind the capital V
# Get a quick overview of the types of data in your matrix and their names
gss_cat %>%
str()
# Only get the column names of a data set
gss_cat %>%
colnames()
# Viewing a subset of the data:
gss_cat %>%
select(marital, age, race) %>%
View()
gss_cat_2000 <- gss_cat %>%
filter(year == 2000) %>%
View()
Selected variables can also be stored as a separate object in the global environment.
# Make a new data frame with only the items (variables) that you want:
gss_cat_social <- gss_cat %>%
select(marital, age, race)
# Using column number to select variables:
gss_cat_social <- gss_cat %>%
select(1:5, 7, 8)
# Make a new data frame with only observations from 2000:
gss_cat_2000 <- gss_cat %>%
filter(year == 2000)
# Make a new data frame with observations from 2002 and later:
gss_cat_2002plus <- gss_cat %>%
filter(year >= 2002)
Adding a variable to the data set can be done with mutate().
# Adding a standardized variable (a z-score) to a data frame
gss_cat <- gss_cat %>%
mutate(ztvhours = (tvhours - mean(tvhours))/sd(tvhours))
# Or use the function `scale()` to get the same standardized variable
gss_cat <- gss_cat %>%
mutate(ztvhours2 = scale(tvhours))
# Creating an index and adding that index to the data frame
dataset5 <- dataset5 %>%
mutate(index = item3 + item4 + item4)
# Creating a logged version of an existing variable (the index in this example)
dataset5 <- dataset5 %>%
mutate(log_version_index = log(index))
# If you want to sum values of a lot of columns and ignore the missings, use
dataset5 <- dataset5 %>%
mutate(sum = rowSums(.[c(1:4, 7:20, 22)], na.rm = TRUE))
# The 'c' in this command stands for 'combine' and is part of base R.
# Change the old name 'relig' to the new variable name 'religion' (new = old)
gss_cat <- gss_cat %>%
rename(religion = relig)
R uses only one way of declaring a specific observation as missing: NA (Not Available). If your data set includes missings, but the missings are coded with a number (for example: 99), you need to replace these values before analyzing the data.
# To change other values (here: the word 'Don't know)' to NA, use na_if()
gss_cat <- gss_cat %>%
mutate(relig = na_if(relig, "Don't know"))
# This means: in the data set gss_cat, change the existing variable relig (which contains contains cases with the word 'Don't know') and declares these units missing (NA).
There are several ways to find out how many missings are included in the data. Also, there are multiple ways to deal with these missings (pairwise deletion versus listwise deletion).
# To see which variables in the data file called 'mydata' have missings use:
summary(gss_cat)
# Or use 'is.na' to detect the number of missing values in a specific column
# (the 'religion' variable).
gss_cat$relig %>%
is.na() %>%
sum()
# Pairwise deletion of cases (exclude cases that have a missing value on a
# variable, but keep them when working with other variables): 'na.rm'
mean(gss_cat$tvhours, na.rm = TRUE)
# Listwise deletion of cases (drop cases that have a missing value on any of
# the variables used): 'na.omit'
gss_cat_no_missings <- na.omit(gss_cat)
# Make a new data frame only containing units without a missing value on the
# variable 'relig'. Please note that you now drop many cases ONLY because this
# variable is missing.
gss_cat_no_missings_on_relig <- gss_cat %>%
filter(!is.na(relig))
# Please note that in R we can use the '!' sign, to say 'not'.
If you have a character variable with the ‘words’ ‘man’ and ‘women’ you probably want to treat this variable as a factor, not merely as a column of words. And if the factor in your data frame has three ‘values’/‘attributes’ (low, medium and high), you have to make sure this variable is stored as an ‘ordered factor’.
# because gss_cat only has numerical variables and factors, we first change a
# factor to a character variable
gss_cat <- gss_cat %>%
mutate(marital_char = as.character(marital))
# changing text back to factor
gss_cat <- gss_cat %>%
mutate(marital_factor = as.factor(marital_char))
# changing factor to ordered factor
gss_cat <- gss_cat %>%
mutate(marital_ord = factor(marital_factor, order = TRUE, levels = c("Never married",
"Married", "Divorced", "Separated", "Widowed", "No answer")))
# In this example, we have put the brackets in a way that avoids omitting a
# bracket (which happens a LOT!)
Suppose you have a data file with a large number of countries over a large number of years (every ‘country x year’ is one observation). You want to add the continents of these countries to the data set. You have another data set with the countries (stored as the same words as in the other data set) and the continents.
# gapm1945to2020 is the original dataset gapmcountries is the set with
# countries and continents Both datasets are loaded in the global environment
# country in both data sets is called 'geo' and both have the same values (the
# same words)
gapm1945joined <- gapm1945to2020 %>%
left_join(gapmcountries, by = "geo")
Sometimes, you want to add data to an existing data file. For this,
you can use the function add_row()
. With
.before
and .after
, you can specify where new
cases should be added.
# to use the add_row() function, the tidyverse packages should be installed and loaded
# all variable names should be included in the command
# data will be added after the last case
dataset1 <- dataset1 %>%
add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1)
# Note the very confusing ".,", which is sometimes used when combining base R commands with tidyverse commands. This case it means something like "we are really using dataset1".
# You can also specify where to add the new case (for example: before case 51)
dataset1 <- dataset1 %>%
add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1, .before = 51)
Sometimes you want to change the values of a variable. Usually it is best to simply make a new variable using mutate (see above). Let us say you have items x10 and x11 of type integer (meaning, only the numbers -1, 0, 1, 2 …. etc.) that are scored from 1 to 3 and you’d like to change number 3 into integers 1, and number 1 into integer 3 (reverse coding, 2 will stay 2). We then make new variables of type integer x10_R and x11_R in the following way. Note that if you use the L, your new data type will be integer (which saves memory). You leave them out if you want the data type to be numeric.
# keep in mind R uses in most cases (but NOT with rename, which is very
# confusing) the 'OLD is now NEW' order of values.
df_psychology <- df_psychology %>%
mutate(x10_R = recode(x10, `1` = 3L, `2` = 2L, `3` = 1L), x11_R = recode(x11,
`1` = 3L, `2` = 2L, `3` = 1L))
# it is often simpler to temporarily ignore the fact a variable is an integer,
# and to simply add the 'as.integer()' command later.
# It is also possible to recode character/factor variables.
data <- data %>%
mutate(var1b = recode(var1a, word = "newword"), var2b = recode(var1b, word2 = "anothernewword"))
The code above does not seem to work for all data files. When you for example imported an SPSS data file with labels, another method should be used.
# for variable x3: 1 -> 0, 2 -> 1
# use: value - 1
data_new <- data_new %>%
mutate(x3_R = x3 -1)
# for variable x1 and x2: 1 -> 3, 2 -> 2, 3 -> 1
# use: value * (-1) + 4
data_new <- data_new %>%
mutate(
x1_R = x1 * (-1) + 4,
x2_R = x2 * (-1) + 4
)
# again, make sure the data are now stored as integers.
When you would like to recode an entire range of values of a variable
into the same different value in a different variable,
case_when()
can be used.
# recode values of the variable "old_var":
# lower than 20 into 1, 20 - 39 into 2, 40 or higher into 3
dataset1 <- dataset1 %>%
mutate(new_var = case_when(
old_var < 20 ~ 1,
old_var >= 20 & old_var < 40 ~ 2,
old_var >= 40 ~ 3
)
)
A simple way to create three dummy variables (dummy1 etc..) out of one nominal variable with three values is:
data$dummy1 <- ifelse(data$nom == 1, 1, 0)
data$dummy2 <- ifelse(data$nom == 2, 1, 0)
data$dummy3 <- ifelse(data$nom == 3, 1, 0)
# when the nominal variable is stored as words use:
data$england <- ifelse(data$country == "england", 1, 0)
This can be used for nominal variables with more values too.
Sometimes data are stored in a relatively wide format (a lot of
variables). For example, all the countries are rows and there is a
variable ‘unemployment_2000’ and a variable ‘unemployment_2001’ etc… In
order to change wide format data into long format data, in which there
are three variables only: country_name, year, level of unemployment, use
pivot_longer()
.
# To create 1 new variable containing the names of 4 variables (Sepal.Length,
# Sepal.Width, Petal.Length, Petal.Width) and 1 variable with the scores:
iris %>%
pivot_longer(cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
names_to = "variable", values_to = "score", values_drop_na = TRUE)
# If you want to restructure variables beginning with the same name (for
# example variables for each week, starting with 'wk'), you can use
# starts_with()
billboardlong %>%
pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank",
values_drop_na = TRUE)
And sometimes you want to go from long format to wide format. Use
pivot_wider()
.
#using the same example
billboardwide %>%
pivot_wider(names_from = week,
values_from = rank)
No examples available yet
A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation
add_row()
as.character()
as.factor()
as.integer()
as.logical()
as.numeric()
attr()
case_when()
colnames()
data.frame()
filter()
getwd()
is.na()
NA
read.csv()
read_spss()
rnorm()
sd()
select()
setwd()
str()
sum()
summary()
left_join()
mean()
mutate()
na_if()
NA
if it is equal to a specific
valuepivot_longer()
pivot_wider()
rename()
rowSums()
View()
When trying to open a dataset I get the message
Error : no such file or directory
When loading a file, R will only search for that file in one directory:
The working directory. You can find out what the working directory is by
using the command getwd()
. If that is not the folder where
your file is in, you need to change it. You can do that using
setwd()
. When typing the path of this directory, you can’t
use the default backslash (\) that is normally used in paths. Therefore,
you need to change it to either a forward slash (/) or a double
backslash (\\). So if your file is in the folder
“C:\Users\username\Downloads”, you type
setwd("C:\\Users\\username\\Downloads")
. If you do not know
the full path of the folder, you can type
rstudioapi::selectDirectory()
in your console. This will
allow you to pick a folder by hand, and will give you the full path you
need to use in your command with the slashes already changed.
If the file still cannot be found, check for any typo’s in the folder
path or filename. When typing the filename, make sure to include the
file extension. This is for example .csv
or
.sav
. File extensions may be hidden in your file explorer,
making them difficult to know. To see file extensions on Windows, open
file explorer, go to the View tab, and under Show/Hide
check the box File name extensions. On Mac, select a file, then
choose File > Get Info. Click the arrow next to Name
& Extension and deselect Hide extension
My computer can’t open .sav files. I don’t have the right
program installed.
You don’t need to open .sav files directly with your computer. You only
have to import the sav files into R. You can do that using the
foreign
library and the read.spss()
command.