This page contains information on how to install R and RStudio, what it is used for and some basic terminology and commands, like installing packages.

Explanation

What is R?

R is a software environment in which you can analyze data by typing instructions (programming). In the field of behavioral and social sciences, researchers and teachers are increasingly working with R instead of programs like STATA, SAS, SPSS and EXCEL, because R has some advantages compared to these programs:

  • it is free, so you can also use it after you finished your studies;
  • is is a collective enterprise, and researchers all over the world write code In R, simplifying tasks, which you can use for free too;
  • it is open-source and therefore transparent;
  • it is extremely flexible, and data analysis and data visualization possibilities in R are almost endless.

R allows you to manage data sets, to change data in order to facilitate analysis, to analyze data and to visualize data. R also allows you to integrate data in text files, which reduces the amount of ‘copy-pasting’ and thus reduces the number of mistakes made in this process. This improves the reproducibility of research.

What is RStudio?

RStudio is a ‘shell’ designed to help you to be more productive with R. RStudio is an Integrated Development Environment (IDE) that helps you develop programs and scripts in R. RStudio has a set of integrated tools that are helpful when analyzing and visualizing your data. More information about RStudio can be found here: https://rstudio.com/products/rstudio/features/. You will probably only encounter R via RStudio.

Why R and RStudio at BMS?

At BMS we think students in the behavioral and social sciences should have a clear understanding of ‘data’ and using R and RStudio will give you both a clear understanding of data and a flexible and transparent way to handle these data. R also allows us to teach statistics in a flexible way, adding new ideas (like ‘big data analysis’) to our curriculum more easily.

Installing R and RStudio on your laptop

For most purposes, installing R and RStudio on your laptop is straightforward. For R go to

For RStudio go to:

and select the free version.

How to learn R?

Although we will offer you a lot of materials to learn the R basics, keep in mind that for learning R even better you need to find your way on the web to learn additional procedures and to practice even more. How?

Some suggestions:

  • use a search engine (like Google) to ask questions like “how do I enter data into R?”;
  • use stackoverflow (https://stackoverflow.com) to check out answers to similar questions;
  • find your own pet project (data with your running times, for example, or gpx files, or IMDB data);
  • read blogs in which data scientists show what can be done with R.

A nice feature in R is the help function. If you need help on a function, just type a “?” in front of a function. If you type “??” in front of the function, you will even get help from all over the web.

# For example use this:

?data.frame
??tidyverse

After typing and running these commands, the answer can be found in the right side lower corner of the RStudio panel.

Terminology: working directory, console, script and more

Learning R includes learning a lot of new words like ‘working directory’, ‘script’ and ‘console’. Although it might be hard to understand the terms completely by reading the explanations below while not yet working in R, you will see that in a few weeks you will understand exactly what is meant with the terms. Reread this document a few times after starting working with R.

The working directory is the folder on your laptop or pc or in the cloud, where all kinds of data, output (including pictures), and scripts are stored as files, so you can quickly find them back later. It is wise to create a working directory for each separate course or data project and to tell R you will use that directory for this specific project.

setwd("[...]")

# The [...] has to be replaced by something applicable to your computer, for
# example.

setwd("/Volumes/myname/Userdata/My Documents/Documenten/Research/2020_Local_elections")

# The current working directory (in case you forget) can be found with the
# command

getwd()

Make sure you can find the path of a working directory in your mac or pc: - on Mac, in Finder, the small ‘radar’ on top allows you to copy the location as ‘pathname’ - in Windows, in the File Explorer, select View in the toolbar, -> Options, select Change folder and search options, to open the Folder Options dialogue box, Click View to open the View tab. In Advanced settings, add a check mark for Display the full path in the title bar, Click Apply, Click OK to close the dialogue box.

A script is a set of commands, for example a command meaning ‘read in my data’ or ‘create a nice a frequency table’. Commands are preferably combined with some # comments explaining what happens in the command lines. In order to distinguish between command lines and comments, comments are preceded by the # sign.

Use - - - - - to break up your script into easily readable chunks.

If you want to run a line of the script, you move the cursor to the respective line and you use ctrl+enter (cmd+enter on a Mac). The instruction will then then automatically move to the R Console and the cursor in the script automatically jumps to the next line.

The console panel (at the bottom in RStudio, with the prompt “>” ) can be used to directly type commands. For example, when using R as a simple calculator you can just type in the numbers in the console. However, you are not advised using the console: do as much as possible via the script.

# adding 1 + 1
> 1 + 1
[1] 2

Objects, including imported data, are stored in the global environment. When you create new objects, these objects are stored in the global environment too. For example, if you create a frequency table, you can store the table (the object) in the global environment. The object can always be called back later in the analysis. The global environment can be saved separately and called later, although for many practical purposes it is sufficient to create a global environment anew, by using the script (remember: the set of commands and comments) which created that environment in the first place. The global environment can be found on the right upper side of RStudio.

Data sets, numbers, and output can be stored as objects in the global environment, using the (left) assignment operator “<-”.1 An object can be a lot of things. For example, if you use R to create a boxplot, the picture of the boxplot can be stored as an object, to be called later when needed.

When naming objects, use only lowercase letters, numbers, and _. Use underscores (_) to separate words within a name. For example, only use names like ‘data_dpes_2021’, or ‘freq_tab_001’.

# for example, to use the 10 cases throughout a script, you can define the
# object 'n' to '10'
n <- 10

# or to create a vector of numbers, you can write ...
dataset_1 <- c(3, 2, 1, 4, 9, 6, 7, 100)

# calling this object in a script or in the console gives the contents of the
# thus created object.
dataset_1
## [1]   3   2   1   4   9   6   7 100
# this give the following output in the console:

Functions in R are ‘commands’ telling R what to do: they are ‘do this’ statements. R has a lot of those, and users are adding more every day. Functions end with parentheses: “()”.2 For example:

# The command head() means 'Show me the first part (the head) of a data set'.
# In this example the called data set is an object called dataset_1 (see
# above).

head(dataset_1)

Data frames and data types

Data frames in R are matrices (numbers and words stored in rows and columns) in which the columns are variables and the rows are units of observation (observations). Sometimes this data frame is called a data matrix, but in the R language a data matrix is a data frame with only numbers, not with strings (text) or other types of variables.
One single column (or: a variable) with numbers or with strings is called a vector.3 Sometimes objects are not just data frames, but combinations of data frames and something more, or combinations of different data frames. These objects are called lists.

For reasons beyond the scope of this introduction, some functions only work with data frames (even if they contain only numbers), not with matrices. In these cases you have to explicitly tell R your matrix is a data frame.

Variables in a data frame have a name (like ‘x’ or ‘gender’). These can be be called directly using the ‘$’ sign. So ‘data$gender’ refers to the column/variable ‘gender’ in the data frame. More generally, the ‘$’ sign refers to a ‘component’ of an object. If you store output as an object to the global environment, you can call a specific element of that output using the ‘$’ sign.

There can be different types of variables in a vector/data frame. The most important are variables containing:

  • logical values (True or False) (used for some dummy variables);
  • a factor consisting of a limited set of attributes (‘low’, ‘middle’ and ‘high’ for example) (used for nominal and ordinal variables). A factor can also be stored as an ‘ordered factor’ and is than not merely ‘nominal’ but ‘ordinal’.
  • integer values (-1, 0, 1, 2, 3 etc) (mainly used for ordinal variables and for count variables);
  • real values (1.001, 1.002, 10000) (used for interval and ratio variables);
  • characters/ words (‘male’, ‘female’) (if these variables are used for dummies (like in ‘male’ / ‘female’) and for nominal variables, you have to change them to (ordered) ‘factor’ variables).

Operators in R

The most frequently used operators used in R, beyond the well known - (Minus), + (Plus), * (Multiplication), / (Division) and <- (the Left assignment operator), are:

operator meaning
^ Exponentiation
! Not
> Greater than
== Equal to not simply ‘=’
>= Greater than or equal to, binary
<= Less than or equal to, binary
: Sequence (in model formulae: interaction)
$ List subset, including a column in a data frame

Some notes on writing readable scripts

Commands can all be put in the console (the one at the bottom-left of RStudio, starting with the “>”). But since you often want to redo the same thing (trying and tweaking), and because you want to ask others for help and show them exactly what you did, we urge you to store all commands in a script.

Open a new script in RStudio (file -> new file -> R script). This will appear in the upper-left pane. Type all commands in the script. This script can be stored (extension is .R) and is basically a .txt file. The rule to use the script as much as possible also applies to things you can do in RStudio. Importing your data can be done by using RStudio, but we urge you to use commands like read.csv() in the script too.

When writing a script, start with a title, preceded with the # (because it is not a command) on top, for example:

# This script is for computing the mean maximum temperature for several days

# This vector includes the highest temperatures recorded on several days
temperature <- c(19, 17, 20, 20, 13, 13, 15, 17)

mean(temperature)  # to compute the mean maximum temperature

When writing commands in the script, always put a space after a comma, and never before a comma, just like in regular English.4

Downloading and activating R Packages

R and RStudio can be used more efficiently, by ‘add ons’ called packages. These packages simplify programming in R as they include new functions. After downloading (= installing) a package, there is no need to do that every time you are using R, although it may be wise to sometimes check for updates. After installing and opening R and RStudio, you can install packages using a command in a script.

# This script installs the packages 'foreign' and 'tidyverse':

install.packages("foreign", "tidyverse")

Downloading (= installing) a package is not the same as ‘calling’ that package for usage in R: packages are not automatically opened when you are using R and RStudio. And that is a good thing: packages can use conflicting short cuts (functions with the same name, but with very different outcomes). Therefore, each time you are using a specific package you have to load that package into the library (aka ‘activate’). A script therefore often starts with ‘calling’ the relevant packages for that script.

# Loading the packages into the library is done with commands like these:

library(foreign)
library(tidyverse)

R packages to be used at BMS

A lot of statistical analyses can be done by using ‘base R’ and its associated functions, but for some more specialized analyses we will use specialized packages.

Since the number of packages is huge5, we have decided to use only a limited number of packages at BMS. The use of R at BMS will be based on the tidyverse6. We will therefore mainly use packages belonging to the tidyverse (that are installed and loaded when installing and loading the tidyverse package).7 We will also use some other packages.

Here is an overview of the packages you will be using at BMS:

  • tidyverse is a set of packages for the most important data manipulation is visualization tasks. This is a ‘meta-package’ containing (among others):
    • ggplot2 to visualize your data;
    • dplyr and tidyr to change data, and to make sure that each variable is in a column, each observation is a row and each value is a cell;
    • readr to import .csv data.

These packages are thus loaded in the library when using the function library("tidyverse").

When installing the tidyverse, some other packages are downloaded too, some of which we will use:

  • readxl to import .xls and .xlsx sheets;
  • haven to import SPSS, Stata, and SAS data;
  • broom to use a tidy() function, which may come handy when doing statistical analysis. This package is also part of tidymodels;
  • modelr to make plots with the residuals and/or the predicted values (this package is also part of tidymodels).

Make sure to explicitly put such a tidyverse related package in the library if you want to use it. Unlike the core packages, these packages are NOT automatically loaded when using the library(“tidyverse”) command.

Finally, we will use a set of specialized packages, including:

  • foreign to import SPSS, Stata, and SAS data;
  • janitor to create frequency tables and cross tabulations;
  • psych for psychometric analysis, including scale construction and factor analysis;
  • lmerTest to approximate p-values;
  • lmtest for various diagnostic tests in the context of linear models (Levene’s test for example);
  • lme4 for the linear mixed model;
  • car for some additional diagnostic tests in the context of linear model (including the vif value);
  • CTT for classical test theory (psychometrics)
  • Lambda4 for Collection of Internal Consistency Reliability Coefficients (psychometrics)
  • mirt for Item Response Theory (psychometrics).

At the beginning of a meeting or course we will tell you which packages to download (install.packages()) and call (library()).

Datasets in R-packages

Apart from a set of commands, most packages also contain data sets, to enable easy illustrations of what packages can do.

# to see which datasets are available in the packages loaded in the library,
# use:

data()

# to see all 'available' data in installed packages on your computer, use:

data(package = .packages(all.available = TRUE))

# sometimes it makes sense to put a data set in the Global Environment:

gss_cat <- gss_cat  # loaded in the package forcats, which is part of the core tidyverse

The ‘pipe’ operator

Some packages add additional operators to the set of base R operators. The ‘pipe’ operator (%>%) is the most important one. We urge you to use the ‘pipe’ function as much as possible.

Suppose we wanted to create a new object with some gss_cat data from 2000 after we imported these data as an object into R. In this data set each row is an individual in a specific year. Religion and partyid are variables in this dataset. Suppose we want to select only data from 2000, and focus only on age and religion, and we want to store these data in a smaller object called ‘gss_cat_small’. We could use:

gss_cat <- gss_cat  # storing the data set in the global environment
gss_cat_small <- filter(select(gss_cat, year, age, relig), year == 2000)  # filtering variables and selecting cases

This is called a nested command (with the functions ‘filter’ and ‘select’). A nested command is often difficult to understand. Much simpler is:

# using a 'pipe' to simplify commands
gss_cat_small <- gss_cat %>%
  select(gss_cat, year, age, relig) %>%
  filter(year == 2000)

This second command will do the same thing and means: you have a big data frame (gss_cat, which is an easily called object in the package), you select some variables from this data frame, and then you filter those cases (from a specific year).

When writing code, use the pipe %>% operator as much as possible. There is special hotkey in RStudio for the pipe operator: Ctrl+Shift+M (Windows & Linux), Cmd+Shift+M (Mac). When indenting after a pipe >%> operator, you need to use two spaces (you can automatically set this in RStudio’s preferences (RStudio / Preferences / Code/ Insert spaces for tab / tab width = 2).

This document will be made available as a reference for what we expect you to know in various phases of your study program. The document will be updated regularly. We want to stick to about 20 to 25 pages. Suggestions for improvement are welcome.

Please note: some commands below work without additional packages (in ‘base R’), however throughout the remainder of this document we assume you have installed the (core) tidyverse packages by using library("tidyverse"). For additional commands we will tell which R-packages you need.

Examples

No examples available yet

Functions

A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation

  • c()
    package: utils (base)
    Combine Values into a Vector or List.
  • data()
    package: utils (base)
    Loads specified data sets, or list the available data sets.
  • filter()
    package: dplyr (tidyverse)
    Find rows where conditions are true
  • install.packages()
    package: utils (base)
    Download and install packages from the internet on your computer
  • library()
    package: base
    Loads a package into a specific project
  • mean()
    package: utils (base)
    Find the mean of a list of numbers.
  • select()
    package: dplyr (tidyverse)
    Get a specific column or columns from a dataset

FAQ

After loading a new package, R tells me that a certain object is masked from another package. What does this mean?
Some objects are included in several packages. For example, the function alpha() is included in both ‘ggplot2’ and ‘psych’. The functions are not the same though. Loading ‘psych’ after loading ‘ggplot’ will give you the warning that alpha is masked from the ggplot2 package. In this case, if you are using the function alpha(), you will use the function from the psych package (as the one from the ggplot2 package is masked).

When trying to install a package I get the message that I need to install Rtools. How do I do that?
R-Tools is a program that is required to program and make packages yourself. Since you will not be doing that, you should not need R-Tools for any of the things you will be doing. However, sometimes RStudio still gives the message that R-Tools is required when installing a package, even when the installation of the package is going successfully. Therefore, when you try to install a package, your console might look something like this: (I will install the library “tidyverse” in this example, but it will look similar with other libraries)

> install.packages("tidyverse")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/username/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/tidyverse_1.3.0.zip'
Content type 'application/zip' length 440017 bytes (429 KB)
downloaded 429 KB

package ‘tidyverse’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\username\AppData\Local\Temp\RtmpM9Rqjf\downloaded_packages

This may look like the installation failed and that you need to install Rtools, however, at the end it says that in fact the installation was successful. Usually, the package is successfully installed, despite the message saying that you need to install Rtools. You can test if the package is in fact installed properly, by calling:

> library(tidyverse)

Change “tidyverse” here of course in the name of the library you need. If you still keep having problems, you can try to install Rtools anyway (even though you probably don’t need it). You can download Rtools for Windows from here: https://cran.uni-muenster.de/bin/windows/Rtools/.

When using a command I get the message Error : could not find function Not all functions you need to use are always available. Many functions are part of packages. That means you need to first install and load the package before the function can be used. For example, the mutate() function is part of the tidyverse. Therefore, if the function mutate can’t be found, you need to load the tidyverse package with library(tidyverse). You can find which library a function belongs to in the function list of the relevant chapter, or by looking up the function on rdocumentation.org.

When loading a package, I get the message Error : there is no package called ‘...’ Before being able to load a package, it first needs to be installed. You can install a package using the function packages.install(). You only need to install the package once, after that it will be available for use in all your projects.

Resources


  1. To be fair, also “=” will work, but most people now use the “<-” assignment operator↩︎

  2. The magrittr package (included in the tidyverse package) allows you to omit parentheses () on functions that do not have arguments. Avoid this feature, because it may be confusing. Be consistent in that functions always have parentheses, objects do not.↩︎

  3. A vector can contain numbers (1, 1.5, 3.333 etc.), or strings (words), or factors (to be explained later), or integers (1,2,3,4,5), etc. A single vector cannot have different types of data.↩︎

  4. In addition: never put spaces inside or outside parentheses for regular function calls. Most infix operators (==, +, -, <-, etc.) should always be surrounded by spaces. Use ” for quoting text. Only use ’ when the text already contains double quotes and no single quotes. If the arguments to a function do not all fit on one line (of max 80 characters), put each argument on its own line and indent with two spaces.↩︎

  5. click here to see all available packages↩︎

  6. click here to see the tidyverse style guide↩︎

  7. What is a bit confusing, is that the tidyverse is both a set of packages, and a ‘programming philosophy’. There are many more packages adhering to that tidyverse philosophy.↩︎