Running R jobs in parallel

Different R packages and strategies can be used to run calculations in parallel. Some popular packages to achieve this task are parallel, Rmpi, doPar, future, and many others. These packages help you to parallelize mainly lapply (and friends) calls. However, most online resources teach us how to use these packages to parallelize processes on a single machine.

To achieve multiple nodes parallelization, one can resort to future.batchtools (link) or rslurm (link).

Embarrassingly parallel jobs

Often, users need to run simulations using R. In these cases, using other tools to achieve parallelization is easier (and more efficient). Currently, we have two options, job arrays, and gnuparallel. However, the former must be used cautiously as it can easily lead to a significant waste of computational resources, impacting other cluster users. The latter gives us more flexibility and is slightly harder to set up.

Toy example: estimating variance

To illustrate our two options, assume we are working with a simple simulation study where we simulate data from a Normal distribution with mean 0 and variance 4. Our end goal is to verify how well we are estimating the variance. Although extremely simple, this example follows the same directions as many simulation studies, and it is an embarrassingly parallel task. In addition, we will use R in this toy example.

Job Arrays approach

Firstly, let us create some directories to store our results. The data directory will store the results per se and the out directory will store any outputs from our script.

mkdir data out

Now, let’s write the slurm script (sim_ja.sh) for the job array approach (for more information on job arrays, see SLURM Job Arrays). Our job array will run the sim_ja.R script 100 times (25 at a time).

#!/bin/bash
#SBATCH --partition=general
#SBATCH --output=out/ja_%A_%a.out
#SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources)
#SBATCH --array=1-100%25
#SBATCH --ntasks=1 ## OBS 1
#SBATCH --cpus-per-task=1 ## OBS 2
#SBATCH --mem-per-cpu=500M ## OBS 3

module load r/4.2.2

## avoiding implicit paralellism
export OMP_NUM_THREADS=1

Rscript sim_ja.R

OBS 1: Most R jobs are sequential. It is very unlikely that you will need more than 1 task.

OBS 2: It is unlikely that you are using any multithreaded code in R.

OBS 3: Debug your code before hand to know how much RAM you need when executing your script.

These three observations are good practices meant to avoid waste of resources.

The sim_ja.R will look as follows:

## storing the "task id"
i <- Sys.getenv("SLURM_ARRAY_TASK_ID")

##--- setting seed for reproducibility ----
set.seed(as.integer(i))
##--- simulating data ----
n <- 100
sigma2 <- 4
y <- rnorm(n, sd = sqrt(sigma2))

##--- computing estimates ----
estimates <- c("s2" = var(y),
               "mle" = (1 - (1/n)) * var(y))
##--- saving results ----
filename <-
    sprintf("data/var_ja_%s.rds", i)
saveRDS(estimates, file = filename)

As usual, the sbatch command is used to get our simulation started:

Looking at the out/ dir, we have the following files

Similarly, the data/ folder has 100 files:

GNU Parallel approach

We can execute the same task using GNU Parallel. All the files to be created here will be placed in the same directory as the ones we created for the job array approach.

The submission script, in this case, will be named sim_gp.sh and looks as follows:

The differences here are:

We need to create an “ID” variable by ourselves;
We pass these IDs as arguments to the R script;
The PAROPTS is a helper bash script loaded when loading the parallel module.

We need to modify the R script (this one we name sim_gp.R). The script for the gnu parallel approach looks as follows:

We are ready to execute our gnu parallel script. To do so, it suffices to run:

Unlike the job array approach, we have a single output file for this job. You can double check the data directory to make sure all the output files were generated.

Processing results

Regardless of the method chosen, the results need to be processed. We can either transfer the raw output files to our machine and process them locally or write a script to process them on the cluster. In this post, we will take a look at the latter option. Extending it to the former is trivial.

A single core will be enough to process the files. Consider the following submission file:

where the process.R script is as follows:

Finally, run the following chunk of code to run the script to process the output:

If we check the data dir, we now see the processed files:

The output file should contain the following output:

The estimates for the gnu parallel and job array approaches look exactly the same because we used the same seeds for the random number generators.