This page will show you how you van import different types of datasets in R and how you can process its data.

Explanation

Creating data frames

Creating your own data set can be done in many ways. We mention two:

typing in data:

#creating a data frame by variable

age <- c(44, 30, 20, 67) %>% as.integer()
gender <- c("male", "female", "male", "female") %>% as.factor()
length <- c(1.67, 1.70, 1.80, 1.81) %>% as.numeric()
membership <- c("T", "F", "T", "T") %>% as.logical()

# merge the variables
dataset_2 <- data.frame(age, gender, length, membership)

generating random data: If you do not have actual data, but want to show something in R, you can randomly select observations from a distribution.

# creating a data frame with a random variable based on the normal
# distribution, assumed mean of the normal distribution is 50, sd is 20, with
# 100 units.

n <- 100
dataset_3 <- rnorm(n, 50, 20) %>%
    as.data.frame()

Importing .csv, .sav and .spss files as data frames: read

To import a data set, you first need to activate the correct package. These pacakges are readr (for csv files, although this is part of the tidyverse core and is thus installed when using library("tidyverse")), foreign (for SPSS files), and readxl (for excel files, also part of the tidyverse, but not of the core, so it needs to be called separately). Next to that, you need to know the location of the data set on your computer.

Suppose you have two version of a data file called ‘dataset_4’ both located in “C:/My Documents”. The data sets can be loaded into the global environment with the following commands:

# set working directory to the correct folder and put all files to be imported
# in that folder

install.packages("tidyverse", "foreign", "haven")  # you do this only once every few months or so
library("tidyverse", "foreign", "readxl", "haven")  # you need to do that every time after (re)starting R

setwd("C:/My Documents")  # setting the working directory

# load a csv data file, in this case separated by comma's, with the package
# 'readr' or with standard tidyverse
data_1a <- read.csv("dataset_4.csv", sep = ",")
data_1b <- read_csv("dataset_4.csv")

# load an spss file with the package foreign
data_2 <- read.spss("dataset_4.sav", to.data.frame = TRUE)

# load an spss file with the package haven
data_3 <- read_spss("dataset_4.sav")

Working with labelled data (mainly SPSS and SAV files): attr

Sometimes imported datasets are ‘labeled’. For example, the variable ‘gender’ is stored as a series of 1’s and 2’s and a label is added for the variable (‘gender as derived from the sampling frame’) and for the values (1 means ‘woman’ and 2 means ‘man’). The labels are attributes of a data frame.

To inspect the labels in a dataframe:

# finding the variable label
data$variable %>%
    attr("label")

# finding the value labels
data$variable %>%
    attr("labels")

View data frames: view, str and colnames

To illustrate the commands used below, we will use the data set gss_cat (from the tidyverse package forcats). This is a sample of data from the general social survey in the US. It contains the variables: year (year of survey, 2000–2014); age (Maximum age truncated to 89); marital (marital status); race (race); rincome (reported income); partyid (party affiliation); relig (religion); denom (denomination); tvhours (hours per day watching tv).

Suppose we have a data frame and you want to inspect it.

# View the data from the data set dataset4 in a spreadsheet.
gss_cat %>%
    View()  # mind the capital V

# Get a quick overview of the types of data in your matrix and their names
gss_cat %>%
    str()

# Only get the column names of a data set
gss_cat %>%
    colnames()

Subsets of a data frame: select, filter and the assignment operator <-

# Viewing a subset of the data:
gss_cat %>%
    select(marital, age, race) %>%
    View()

gss_cat_2000 <- gss_cat %>%
    filter(year == 2000) %>%
    View()

Selected variables can also be stored as a separate object in the global environment.

# Make a new data frame with only the items (variables) that you want:
gss_cat_social <- gss_cat %>%
    select(marital, age, race)

# Using column number to select variables:
gss_cat_social <- gss_cat %>%
    select(1:5, 7, 8)

# Make a new data frame with only observations from 2000:
gss_cat_2000 <- gss_cat %>%
    filter(year == 2000)

# Make a new data frame with observations from 2002 and later:
gss_cat_2002plus <- gss_cat %>%
    filter(year >= 2002)

Adding a new variable to an existing data frame: mutate

Adding a variable to the data set can be done with mutate().

# Adding a standardized variable (a z-score) to a data frame
gss_cat <- gss_cat %>%
    mutate(ztvhours = (tvhours - mean(tvhours))/sd(tvhours))

# Or use the function `scale()` to get the same standardized variable
gss_cat <- gss_cat %>%
    mutate(ztvhours2 = scale(tvhours))


# Creating an index and adding that index to the data frame
dataset5 <- dataset5 %>%
    mutate(index = item3 + item4 + item4)

# Creating a logged version of an existing variable (the index in this example)
dataset5 <- dataset5 %>%
    mutate(log_version_index = log(index))

# If you want to sum values of a lot of columns and ignore the missings, use
dataset5 <- dataset5 %>%
    mutate(sum = rowSums(.[c(1:4, 7:20, 22)], na.rm = TRUE))
# The 'c' in this command stands for 'combine' and is part of base R.

Renaming a variable in a data frame: rename

# Change the old name 'relig' to the new variable name 'religion' (new = old)

gss_cat <- gss_cat %>% 
  rename(religion = relig)

Dealing with missing data: na_if, is.na, na.rm, na.omit

R uses only one way of declaring a specific observation as missing: NA (Not Available). If your data set includes missings, but the missings are coded with a number (for example: 99), you need to replace these values before analyzing the data.

# To change other values (here: the word 'Don't know)' to NA, use na_if()

gss_cat <- gss_cat %>%
  mutate(relig = na_if(relig, "Don't know"))
# This means: in the data set gss_cat, change the existing variable relig (which contains contains cases with the word 'Don't know') and declares these units missing (NA).

There are several ways to find out how many missings are included in the data. Also, there are multiple ways to deal with these missings (pairwise deletion versus listwise deletion).

# To see which variables in the data file called 'mydata' have missings use:
summary(gss_cat)

# Or use 'is.na' to detect the number of missing values in a specific column
# (the 'religion' variable).
gss_cat$relig %>%
    is.na() %>%
    sum()

# Pairwise deletion of cases (exclude cases that have a missing value on a
# variable, but keep them when working with other variables): 'na.rm'
mean(gss_cat$tvhours, na.rm = TRUE)

# Listwise deletion of cases (drop cases that have a missing value on any of
# the variables used): 'na.omit'
gss_cat_no_missings <- na.omit(gss_cat)

# Make a new data frame only containing units without a missing value on the
# variable 'relig'. Please note that you now drop many cases ONLY because this
# variable is missing.
gss_cat_no_missings_on_relig <- gss_cat %>%
    filter(!is.na(relig))

# Please note that in R we can use the '!' sign, to say 'not'.

Changing the type of variable: from character to factor and from factor to ordered factor

If you have a character variable with the ‘words’ ‘man’ and ‘women’ you probably want to treat this variable as a factor, not merely as a column of words. And if the factor in your data frame has three ‘values’/‘attributes’ (low, medium and high), you have to make sure this variable is stored as an ‘ordered factor’.

# because gss_cat only has numerical variables and factors, we first change a
# factor to a character variable
gss_cat <- gss_cat %>%
    mutate(marital_char = as.character(marital))

# changing text back to factor
gss_cat <- gss_cat %>%
    mutate(marital_factor = as.factor(marital_char))

# changing factor to ordered factor
gss_cat <- gss_cat %>%
    mutate(marital_ord = factor(marital_factor, order = TRUE, levels = c("Never married",
        "Married", "Divorced", "Separated", "Widowed", "No answer")))
# In this example, we have put the brackets in a way that avoids omitting a
# bracket (which happens a LOT!)

Adding a new variable from another datafile to an existing data frame: left_join

Suppose you have a data file with a large number of countries over a large number of years (every ‘country x year’ is one observation). You want to add the continents of these countries to the data set. You have another data set with the countries (stored as the same words as in the other data set) and the continents.

# gapm1945to2020 is the original dataset gapmcountries is the set with
# countries and continents Both datasets are loaded in the global environment
# country in both data sets is called 'geo' and both have the same values (the
# same words)

gapm1945joined <- gapm1945to2020 %>%
    left_join(gapmcountries, by = "geo")

Adding observations to a data file: add_row

Sometimes, you want to add data to an existing data file. For this, you can use the function add_row(). With .before and .after, you can specify where new cases should be added.

# to use the add_row() function, the tidyverse packages should be installed and loaded
# all variable names should be included in the command

# data will be added after the last case
dataset1 <- dataset1 %>% 
   add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1)

# Note the  very confusing ".,", which is sometimes used when combining base R commands with tidyverse commands. This case it means something like "we are really using dataset1".

# You can also specify where to add the new case (for example: before case 51)
dataset1 <- dataset1 %>% 
   add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1, .before = 51)

Recoding and changing variables in an existing data frame: recode and case_when

Recoding single values of a variable into different values in one new variable

Sometimes you want to change the values of a variable. Usually it is best to simply make a new variable using mutate (see above). Let us say you have items x10 and x11 of type integer (meaning, only the numbers -1, 0, 1, 2 …. etc.) that are scored from 1 to 3 and you’d like to change number 3 into integers 1, and number 1 into integer 3 (reverse coding, 2 will stay 2). We then make new variables of type integer x10_R and x11_R in the following way. Note that if you use the L, your new data type will be integer (which saves memory). You leave them out if you want the data type to be numeric.

# keep in mind R uses in most cases (but NOT with rename, which is very
# confusing) the 'OLD is now NEW' order of values.

df_psychology <- df_psychology %>%
    mutate(x10_R = recode(x10, `1` = 3L, `2` = 2L, `3` = 1L), x11_R = recode(x11,
        `1` = 3L, `2` = 2L, `3` = 1L))

# it is often simpler to temporarily ignore the fact a variable is an integer,
# and to simply add the 'as.integer()' command later.

# It is also possible to recode character/factor variables.

data <- data %>%
    mutate(var1b = recode(var1a, word = "newword"), var2b = recode(var1b, word2 = "anothernewword"))

The code above does not seem to work for all data files. When you for example imported an SPSS data file with labels, another method should be used.

# for variable x3: 1 -> 0, 2 -> 1
# use: value - 1

data_new <- data_new %>% 
  mutate(x3_R = x3 -1)

# for variable x1 and x2: 1 -> 3, 2 -> 2, 3 -> 1
# use: value * (-1) + 4
data_new <- data_new %>% 
  mutate(
    x1_R = x1 * (-1) + 4,
    x2_R = x2 * (-1) + 4
  )

# again, make sure the data are now stored as integers.

Recoding a range of values of a variable into different values in a new variable: case_when

When you would like to recode an entire range of values of a variable into the same different value in a different variable, case_when() can be used.

# recode values of the variable "old_var":
# lower than 20 into 1, 20 - 39 into 2, 40 or higher into 3

dataset1 <- dataset1 %>%
      mutate(new_var = case_when(
        old_var < 20 ~ 1,
          old_var >= 20 & old_var < 40 ~ 2,
        old_var >= 40 ~ 3
  )
)

Creating dummy variables

A simple way to create three dummy variables (dummy1 etc..) out of one nominal variable with three values is:

data$dummy1 <- ifelse(data$nom == 1, 1, 0)
data$dummy2 <- ifelse(data$nom == 2, 1, 0)
data$dummy3 <- ifelse(data$nom == 3, 1, 0)

# when the nominal variable is stored as words use:
data$england <- ifelse(data$country == "england", 1, 0)

This can be used for nominal variables with more values too.

Creating longer or broader data frames using pivot_longer and pivot_wider

Sometimes data are stored in a relatively wide format (a lot of variables). For example, all the countries are rows and there is a variable ‘unemployment_2000’ and a variable ‘unemployment_2001’ etc… In order to change wide format data into long format data, in which there are three variables only: country_name, year, level of unemployment, use pivot_longer().

# To create 1 new variable containing the names of 4 variables (Sepal.Length,
# Sepal.Width, Petal.Length, Petal.Width) and 1 variable with the scores:

iris %>%
    pivot_longer(cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
        names_to = "variable", values_to = "score", values_drop_na = TRUE)

# If you want to restructure variables beginning with the same name (for
# example variables for each week, starting with 'wk'), you can use
# starts_with()

billboardlong %>%
    pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank",
        values_drop_na = TRUE)

And sometimes you want to go from long format to wide format. Use pivot_wider().

#using the same example
billboardwide %>%
  pivot_wider(names_from = week, 
              values_from = rank)

Examples

No examples available yet

Functions

A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation

add_row()
package: tibble (tidyverse)
Add a row to an existing dataset
as.character()
package: base
Converts the data type of a variable or object into a character (text)
as.factor()
package: base
Converts the data type of a variable or object into a factor (categories)
as.integer()
package: base
Converts the data type of a variable or object into a integer (whole numbers)
as.logical()
package: base
Converts the data type of a variable or object into a locical variable (TRUE or FALSE)
as.numeric()
package: base
Converts the data type of a variable or object into a number (real number)
attr()
package: base
Get or set specific attributes of an object.
case_when()
package: dplyr (tidyverse)
Return one or an other variable based on a condition
colnames()
package: base
Only get the column names of a data set
data.frame()
package: base
Creates a dataframe from multiple lists
filter()
package: dplyr (tidyverse)
Get specific rows from the dataset based on certain conditions
getwd()
package: base
Tells you what the current working directory is
is.na()
package: base
Check if a value is equal to NA
read.csv()
package: utils (base)
Lets you import a .sav file made for SPSS into R as a dataset
read_spss()
package: haven
rnorm()
package: stats (base)
Creates random numbers with a normal distribution Lets you import a .sav file made for SPSS into R as a dataset
sd()
package: stats
Find the standard deviation of a list of numbers.
select()
package: dplyr (tidyverse)
Get a specific column or columns from a dataset
setwd()
package: base
Sets the working directory (the folder in which R will look for your files)
str()
package: utils (base)
Get a quick overview of the types of data in your matrix and their names
sum()
package: base
Find the sum of a list of numbers.
summary()
package: base
Give a summary of all variables in a dataset, giving information like minimum, maximum, mean, and frequencies
left_join()
package: dplyr (tidyverse)
Merge two datasets together using a variable they have in common
mean()
package: utils (base)
Find the mean of a list of numbers.
mutate()
package: dplyr (tidyverse)
Change the values of an entire column in a dataset
na_if()
package: dplyr (tidyverse)
Change a value to NA if it is equal to a specific value
pivot_longer()
package: tidyr (tidyverse)
Change a dataset from short format to long format
pivot_wider()
package: tidyr (tidyverse)
Change a dataset from long format to short format
rename()
package: dplyr (tidyverse)
Changes the name of a variable in a dataset
rowSums()
package: base
Get the sum of the values of the rows in a dataset
View()
package: utils (base)
Shows a dataframe as a table in RStudio

FAQ

When trying to open a dataset I get the message Error : no such file or directory
When loading a file, R will only search for that file in one directory: The working directory. You can find out what the working directory is by using the command getwd(). If that is not the folder where your file is in, you need to change it. You can do that using setwd(). When typing the path of this directory, you can’t use the default backslash (\) that is normally used in paths. Therefore, you need to change it to either a forward slash (/) or a double backslash (\\). So if your file is in the folder “C:\Users\username\Downloads”, you type setwd("C:\\Users\\username\\Downloads"). If you do not know the full path of the folder, you can type rstudioapi::selectDirectory() in your console. This will allow you to pick a folder by hand, and will give you the full path you need to use in your command with the slashes already changed.

If the file still cannot be found, check for any typo’s in the folder path or filename. When typing the filename, make sure to include the file extension. This is for example .csv or .sav. File extensions may be hidden in your file explorer, making them difficult to know. To see file extensions on Windows, open file explorer, go to the View tab, and under Show/Hide check the box File name extensions. On Mac, select a file, then choose File > Get Info. Click the arrow next to Name & Extension and deselect Hide extension

My computer can’t open .sav files. I don’t have the right program installed.
You don’t need to open .sav files directly with your computer. You only have to import the sav files into R. You can do that using the foreign library and the read.spss() command.

Resources

Data transformation cheatsheet
An overview of all the functions you can use to find and manipulate data in your dataframe.

Handling data files