This page contains information on how to install R and RStudio, what it is used for and some basic terminology and commands, like installing packages.
R is a software environment in which you can analyze data by typing instructions (programming). In the field of behavioral and social sciences, researchers and teachers are increasingly working with R instead of programs like STATA, SAS, SPSS and EXCEL, because R has some advantages compared to these programs:
R allows you to manage data sets, to change data in order to facilitate analysis, to analyze data and to visualize data. R also allows you to integrate data in text files, which reduces the amount of ‘copy-pasting’ and thus reduces the number of mistakes made in this process. This improves the reproducibility of research.
RStudio is a ‘shell’ designed to help you to be more productive with R. RStudio is an Integrated Development Environment (IDE) that helps you develop programs and scripts in R. RStudio has a set of integrated tools that are helpful when analyzing and visualizing your data. More information about RStudio can be found here: https://rstudio.com/products/rstudio/features/. You will probably only encounter R via RStudio.
At BMS we think students in the behavioral and social sciences should have a clear understanding of ‘data’ and using R and RStudio will give you both a clear understanding of data and a flexible and transparent way to handle these data. R also allows us to teach statistics in a flexible way, adding new ideas (like ‘big data analysis’) to our curriculum more easily.
For most purposes, installing R and RStudio on your laptop is straightforward. For R go to
For RStudio go to:
and select the free version.
Although we will offer you a lot of materials to learn the R basics, keep in mind that for learning R even better you need to find your way on the web to learn additional procedures and to practice even more. How?
Some suggestions:
A nice feature in R is the help function. If you need help on a function, just type a “?” in front of a function. If you type “??” in front of the function, you will even get help from all over the web.
# For example use this:
?data.frame
??tidyverse
After typing and running these commands, the answer can be found in the right side lower corner of the RStudio panel.
Learning R includes learning a lot of new words like ‘working directory’, ‘script’ and ‘console’. Although it might be hard to understand the terms completely by reading the explanations below while not yet working in R, you will see that in a few weeks you will understand exactly what is meant with the terms. Reread this document a few times after starting working with R.
The working directory is the folder on your laptop or pc or in the cloud, where all kinds of data, output (including pictures), and scripts are stored as files, so you can quickly find them back later. It is wise to create a working directory for each separate course or data project and to tell R you will use that directory for this specific project.
setwd("[...]")
# The [...] has to be replaced by something applicable to your computer, for
# example.
setwd("/Volumes/myname/Userdata/My Documents/Documenten/Research/2020_Local_elections")
# The current working directory (in case you forget) can be found with the
# command
getwd()
Make sure you can find the path of a working directory in your mac or pc: - on Mac, in Finder, the small ‘radar’ on top allows you to copy the location as ‘pathname’ - in Windows, in the File Explorer, select View in the toolbar, -> Options, select Change folder and search options, to open the Folder Options dialogue box, Click View to open the View tab. In Advanced settings, add a check mark for Display the full path in the title bar, Click Apply, Click OK to close the dialogue box.
A script is a set of commands, for example a command meaning ‘read in my data’ or ‘create a nice a frequency table’. Commands are preferably combined with some # comments explaining what happens in the command lines. In order to distinguish between command lines and comments, comments are preceded by the # sign.
Use - - - - - to break up your script into easily readable chunks.
If you want to run a line of the script, you move the cursor to the respective line and you use ctrl+enter (cmd+enter on a Mac). The instruction will then then automatically move to the R Console and the cursor in the script automatically jumps to the next line.
The console panel (at the bottom in RStudio, with the prompt “>” ) can be used to directly type commands. For example, when using R as a simple calculator you can just type in the numbers in the console. However, you are not advised using the console: do as much as possible via the script.
# adding 1 + 1
> 1 + 1
[1] 2
Objects, including imported data, are stored in the global environment. When you create new objects, these objects are stored in the global environment too. For example, if you create a frequency table, you can store the table (the object) in the global environment. The object can always be called back later in the analysis. The global environment can be saved separately and called later, although for many practical purposes it is sufficient to create a global environment anew, by using the script (remember: the set of commands and comments) which created that environment in the first place. The global environment can be found on the right upper side of RStudio.
Data sets, numbers, and output can be stored as objects in the global environment, using the (left) assignment operator “<-”.1 An object can be a lot of things. For example, if you use R to create a boxplot, the picture of the boxplot can be stored as an object, to be called later when needed.
When naming objects, use only lowercase letters, numbers, and _. Use underscores (_) to separate words within a name. For example, only use names like ‘data_dpes_2021’, or ‘freq_tab_001’.
# for example, to use the 10 cases throughout a script, you can define the
# object 'n' to '10'
n <- 10
# or to create a vector of numbers, you can write ...
dataset_1 <- c(3, 2, 1, 4, 9, 6, 7, 100)
# calling this object in a script or in the console gives the contents of the
# thus created object.
dataset_1
## [1] 3 2 1 4 9 6 7 100
# this give the following output in the console:
Functions in R are ‘commands’ telling R what to do: they are ‘do this’ statements. R has a lot of those, and users are adding more every day. Functions end with parentheses: “()”.2 For example:
# The command head() means 'Show me the first part (the head) of a data set'.
# In this example the called data set is an object called dataset_1 (see
# above).
head(dataset_1)
Data frames in R are matrices (numbers and words stored in
rows and columns) in which the columns are variables and the rows are
units of observation (observations). Sometimes this data frame is called
a data matrix, but in the R language a data matrix is a data
frame with only numbers, not with strings (text) or other types of
variables.
One single column (or: a variable) with numbers or with strings is
called a vector.3 Sometimes objects are not just data frames,
but combinations of data frames and something more, or combinations of
different data frames. These objects are called lists.
For reasons beyond the scope of this introduction, some functions only work with data frames (even if they contain only numbers), not with matrices. In these cases you have to explicitly tell R your matrix is a data frame.
Variables in a data frame have a name (like ‘x’ or ‘gender’). These can be be called directly using the ‘$’ sign. So ‘data$gender’ refers to the column/variable ‘gender’ in the data frame. More generally, the ‘$’ sign refers to a ‘component’ of an object. If you store output as an object to the global environment, you can call a specific element of that output using the ‘$’ sign.
There can be different types of variables in a vector/data frame. The most important are variables containing:
The most frequently used operators used in R, beyond the well known - (Minus), + (Plus), * (Multiplication), / (Division) and <- (the Left assignment operator), are:
operator | meaning |
---|---|
^ | Exponentiation |
! | Not |
> | Greater than |
== | Equal to not simply ‘=’ |
>= | Greater than or equal to, binary |
<= | Less than or equal to, binary |
: | Sequence (in model formulae: interaction) |
$ | List subset, including a column in a data frame |
Commands can all be put in the console (the one at the bottom-left of RStudio, starting with the “>”). But since you often want to redo the same thing (trying and tweaking), and because you want to ask others for help and show them exactly what you did, we urge you to store all commands in a script.
Open a new script in RStudio (file -> new file -> R script).
This will appear in the upper-left pane. Type all commands in the
script. This script can be stored (extension is .R) and is basically a
.txt file. The rule to use the script as much as possible also applies
to things you can do in RStudio. Importing your data
can be done by using RStudio, but we urge you to use
commands like read.csv()
in the script too.
When writing a script, start with a title, preceded with the # (because it is not a command) on top, for example:
# This script is for computing the mean maximum temperature for several days
# This vector includes the highest temperatures recorded on several days
temperature <- c(19, 17, 20, 20, 13, 13, 15, 17)
mean(temperature) # to compute the mean maximum temperature
When writing commands in the script, always put a space after a comma, and never before a comma, just like in regular English.4
R and RStudio can be used more efficiently, by ‘add ons’ called packages. These packages simplify programming in R as they include new functions. After downloading (= installing) a package, there is no need to do that every time you are using R, although it may be wise to sometimes check for updates. After installing and opening R and RStudio, you can install packages using a command in a script.
# This script installs the packages 'foreign' and 'tidyverse':
install.packages("foreign", "tidyverse")
Downloading (= installing) a package is not the same as ‘calling’ that package for usage in R: packages are not automatically opened when you are using R and RStudio. And that is a good thing: packages can use conflicting short cuts (functions with the same name, but with very different outcomes). Therefore, each time you are using a specific package you have to load that package into the library (aka ‘activate’). A script therefore often starts with ‘calling’ the relevant packages for that script.
# Loading the packages into the library is done with commands like these:
library(foreign)
library(tidyverse)
A lot of statistical analyses can be done by using ‘base R’ and its associated functions, but for some more specialized analyses we will use specialized packages.
Since the number of packages is huge5, we have decided to use only a limited number of packages at BMS. The use of R at BMS will be based on the tidyverse6. We will therefore mainly use packages belonging to the tidyverse (that are installed and loaded when installing and loading the tidyverse package).7 We will also use some other packages.
Here is an overview of the packages you will be using at BMS:
tidyverse
is a set of packages for the most important
data manipulation is visualization tasks. This is a ‘meta-package’
containing (among others):
ggplot2
to visualize your data;dplyr
and tidyr
to change data, and to
make sure that each variable is in a column, each observation is a row
and each value is a cell;readr
to import .csv data.These packages are thus loaded in the library when using the function
library("tidyverse")
.
When installing the tidyverse, some other packages are downloaded too, some of which we will use:
readxl
to import .xls and .xlsx sheets;haven
to import SPSS, Stata, and SAS data;broom
to use a tidy()
function, which may
come handy when doing statistical analysis. This package is also part of
tidymodels
;modelr
to make plots with the residuals and/or the
predicted values (this package is also part of
tidymodels
).Make sure to explicitly put such a tidyverse related package in the library if you want to use it. Unlike the core packages, these packages are NOT automatically loaded when using the library(“tidyverse”) command.
Finally, we will use a set of specialized packages, including:
foreign
to import SPSS, Stata, and SAS data;janitor
to create frequency tables and cross
tabulations;psych
for psychometric analysis, including scale
construction and factor analysis;lmerTest
to approximate p-values;lmtest
for various diagnostic tests in the context of
linear models (Levene’s test for example);lme4
for the linear mixed model;car
for some additional diagnostic tests in the context
of linear model (including the vif value);CTT
for classical test theory (psychometrics)Lambda4
for Collection of Internal Consistency
Reliability Coefficients (psychometrics)mirt
for Item Response Theory (psychometrics).At the beginning of a meeting or course we will tell you which
packages to download (install.packages()
) and call
(library()
).
Apart from a set of commands, most packages also contain data sets, to enable easy illustrations of what packages can do.
# to see which datasets are available in the packages loaded in the library,
# use:
data()
# to see all 'available' data in installed packages on your computer, use:
data(package = .packages(all.available = TRUE))
# sometimes it makes sense to put a data set in the Global Environment:
gss_cat <- gss_cat # loaded in the package forcats, which is part of the core tidyverse
Some packages add additional operators to the set of base R
operators. The ‘pipe’ operator (%>%
) is the most
important one. We urge you to use the ‘pipe’ function as much as
possible.
Suppose we wanted to create a new object with some gss_cat data from 2000 after we imported these data as an object into R. In this data set each row is an individual in a specific year. Religion and partyid are variables in this dataset. Suppose we want to select only data from 2000, and focus only on age and religion, and we want to store these data in a smaller object called ‘gss_cat_small’. We could use:
gss_cat <- gss_cat # storing the data set in the global environment
gss_cat_small <- filter(select(gss_cat, year, age, relig), year == 2000) # filtering variables and selecting cases
This is called a nested command (with the functions ‘filter’ and ‘select’). A nested command is often difficult to understand. Much simpler is:
# using a 'pipe' to simplify commands
gss_cat_small <- gss_cat %>%
select(gss_cat, year, age, relig) %>%
filter(year == 2000)
This second command will do the same thing and means: you have a big data frame (gss_cat, which is an easily called object in the package), you select some variables from this data frame, and then you filter those cases (from a specific year).
When writing code, use the pipe %>%
operator as much
as possible. There is special hotkey in RStudio for the pipe operator:
Ctrl+Shift+M (Windows & Linux), Cmd+Shift+M (Mac). When indenting
after a pipe >%> operator, you need to use two spaces (you can
automatically set this in RStudio’s preferences (RStudio / Preferences /
Code/ Insert spaces for tab / tab width = 2).
This document will be made available as a reference for what we expect you to know in various phases of your study program. The document will be updated regularly. We want to stick to about 20 to 25 pages. Suggestions for improvement are welcome.
Please note: some commands below work without additional packages (in
‘base R’), however throughout the remainder of this document we assume
you have installed the (core) tidyverse packages by using
library("tidyverse")
. For additional commands we will tell
which R-packages you need.
No examples available yet
A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation
c()
data()
filter()
install.packages()
library()
mean()
select()
After loading a new package, R tells me that a certain object
is masked from another package. What does this mean?
Some objects are included in several packages. For example, the function
alpha() is included in both ‘ggplot2’ and ‘psych’. The functions are not
the same though. Loading ‘psych’ after loading ‘ggplot’ will give you
the warning that alpha is masked from the ggplot2 package. In this case,
if you are using the function alpha(), you will use the function from
the psych package (as the one from the ggplot2 package is masked).
When trying to install a package I get the message that I
need to install Rtools. How do I do that?
R-Tools is a program that is required to program and make packages
yourself. Since you will not be doing that, you should not need R-Tools
for any of the things you will be doing. However, sometimes RStudio
still gives the message that R-Tools is required when installing a
package, even when the installation of the package is going
successfully. Therefore, when you try to install a package, your console
might look something like this: (I will install the library “tidyverse”
in this example, but it will look similar with other libraries)
> install.packages("tidyverse")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/username/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/tidyverse_1.3.0.zip'
Content type 'application/zip' length 440017 bytes (429 KB)
downloaded 429 KB
package ‘tidyverse’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\username\AppData\Local\Temp\RtmpM9Rqjf\downloaded_packages
This may look like the installation failed and that you need to install Rtools, however, at the end it says that in fact the installation was successful. Usually, the package is successfully installed, despite the message saying that you need to install Rtools. You can test if the package is in fact installed properly, by calling:
> library(tidyverse)
Change “tidyverse” here of course in the name of the library you need. If you still keep having problems, you can try to install Rtools anyway (even though you probably don’t need it). You can download Rtools for Windows from here: https://cran.uni-muenster.de/bin/windows/Rtools/.
When using a command I get the message
Error : could not find function
Not all functions
you need to use are always available. Many functions are part of
packages. That means you need to first install and load the package
before the function can be used. For example, the mutate()
function is part of the tidyverse
. Therefore, if the
function mutate can’t be found, you need to load the tidyverse package
with library(tidyverse)
. You can find which library a
function belongs to in the function list of the
relevant chapter, or by looking up the function on rdocumentation.org.
When loading a package, I get the message
Error : there is no package called ‘...’
Before
being able to load a package, it first needs to be installed. You can
install a package using the function packages.install()
.
You only need to install the package once, after that it will be
available for use in all your projects.
To be fair, also “=” will work, but most people now use the “<-” assignment operator↩︎
The magrittr
package (included in the
tidyverse
package) allows you to omit parentheses () on
functions that do not have arguments. Avoid this feature, because it may
be confusing. Be consistent in that functions always have parentheses,
objects do not.↩︎
A vector can contain numbers (1, 1.5, 3.333 etc.), or strings (words), or factors (to be explained later), or integers (1,2,3,4,5), etc. A single vector cannot have different types of data.↩︎
In addition: never put spaces inside or outside parentheses for regular function calls. Most infix operators (==, +, -, <-, etc.) should always be surrounded by spaces. Use ” for quoting text. Only use ’ when the text already contains double quotes and no single quotes. If the arguments to a function do not all fit on one line (of max 80 characters), put each argument on its own line and indent with two spaces.↩︎
What is a bit confusing, is that the tidyverse is both a set of packages, and a ‘programming philosophy’. There are many more packages adhering to that tidyverse philosophy.↩︎