Running R jobs in parallel

Different R packages and strategies can be used to run calculations in parallel. Some popular packages to achieve this task are parallel, Rmpi, doPar, future, and many others. These packages help you to parallelize mainly lapply (and friends) calls. However, most online resources teach us how to use these packages to parallelize processes on a single machine.

To achieve multiple nodes parallelization, one can resort to future.batchtools (link) or rslurm (link).

Embarrassingly parallel jobs

Often, users need to run simulations using R. In these cases, using other tools to achieve parallelization is easier (and more efficient). Currently, we have two options, job arrays, and gnuparallel. However, the former must be used cautiously as it can easily lead to a significant waste of computational resources, impacting other cluster users. The latter gives us more flexibility and is slightly harder to set up.

Toy example: estimating variance

To illustrate our two options, assume we are working with a simple simulation study where we simulate data from a Normal distribution with mean 0 and variance 4. Our end goal is to verify how well we are estimating the variance. Although extremely simple, this example follows the same directions as many simulation studies, and it is an embarrassingly parallel task. In addition, we will use R in this toy example.

Job Arrays approach

Firstly, let us create some directories to store our results. The data directory will store the results per se and the out directory will store any outputs from our script.

mkdir data out

Now, let’s write the slurm script (sim_ja.sh) for the job array approach (for more information on job arrays, see SLURM Job Arrays). Our job array will run the sim_ja.R script 100 times (25 at a time).

#!/bin/bash
#SBATCH --partition=general
#SBATCH --output=out/ja_%A_%a.out
#SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources)
#SBATCH --array=1-100%25
#SBATCH --ntasks=1 ## OBS 1
#SBATCH --cpus-per-task=1 ## OBS 2
#SBATCH --mem-per-cpu=500M ## OBS 3

module load r/4.2.2

## avoiding implicit paralellism
export OMP_NUM_THREADS=1

Rscript sim_ja.R

OBS 1: Most R jobs are sequential. It is very unlikely that you will need more than 1 task.

OBS 2: It is unlikely that you are using any multithreaded code in R.

OBS 3: Debug your code before hand to know how much RAM you need when executing your script.

These three observations are good practices meant to avoid waste of resources.

The sim_ja.R will look as follows:

## storing the "task id"
i <- Sys.getenv("SLURM_ARRAY_TASK_ID")

##--- setting seed for reproducibility ----
set.seed(as.integer(i))
##--- simulating data ----
n <- 100
sigma2 <- 4
y <- rnorm(n, sd = sqrt(sigma2))

##--- computing estimates ----
estimates <- c("s2" = var(y),
               "mle" = (1 - (1/n)) * var(y))
##--- saving results ----
filename <-
    sprintf("data/var_ja_%s.rds", i)
saveRDS(estimates, file = filename)

As usual, the sbatch command is used to get our simulation started:

[ldc19002@login4 r-example]$ sbatch sim_ja.sh
> Submitted batch job 2038716

Looking at the out/ dir, we have the following files

[ldc19002@login4 r-example]$ ls out/
ja_2038580_100.out  ja_2038580_23.out  ja_2038580_37.out  ja_2038580_50.out  ja_2038580_64.out  ja_2038580_78.out  ja_2038580_91.out
ja_2038580_10.out   ja_2038580_24.out  ja_2038580_38.out  ja_2038580_51.out  ja_2038580_65.out  ja_2038580_79.out  ja_2038580_92.out
ja_2038580_11.out   ja_2038580_25.out  ja_2038580_39.out  ja_2038580_52.out  ja_2038580_66.out  ja_2038580_7.out   ja_2038580_93.out
ja_2038580_12.out   ja_2038580_26.out  ja_2038580_3.out   ja_2038580_53.out  ja_2038580_67.out  ja_2038580_80.out  ja_2038580_94.out
ja_2038580_13.out   ja_2038580_27.out  ja_2038580_40.out  ja_2038580_54.out  ja_2038580_68.out  ja_2038580_81.out  ja_2038580_95.out
ja_2038580_14.out   ja_2038580_28.out  ja_2038580_41.out  ja_2038580_55.out  ja_2038580_69.out  ja_2038580_82.out  ja_2038580_96.out
ja_2038580_15.out   ja_2038580_29.out  ja_2038580_42.out  ja_2038580_56.out  ja_2038580_6.out   ja_2038580_83.out  ja_2038580_97.out
ja_2038580_16.out   ja_2038580_2.out   ja_2038580_43.out  ja_2038580_57.out  ja_2038580_70.out  ja_2038580_84.out  ja_2038580_98.out
ja_2038580_17.out   ja_2038580_30.out  ja_2038580_44.out  ja_2038580_58.out  ja_2038580_71.out  ja_2038580_85.out  ja_2038580_99.out
ja_2038580_18.out   ja_2038580_31.out  ja_2038580_45.out  ja_2038580_59.out  ja_2038580_72.out  ja_2038580_86.out  ja_2038580_9.out
ja_2038580_19.out   ja_2038580_32.out  ja_2038580_46.out  ja_2038580_5.out   ja_2038580_73.out  ja_2038580_87.out
ja_2038580_1.out    ja_2038580_33.out  ja_2038580_47.out  ja_2038580_60.out  ja_2038580_74.out  ja_2038580_88.out
ja_2038580_20.out   ja_2038580_34.out  ja_2038580_48.out  ja_2038580_61.out  ja_2038580_75.out  ja_2038580_89.out
ja_2038580_21.out   ja_2038580_35.out  ja_2038580_49.out  ja_2038580_62.out  ja_2038580_76.out  ja_2038580_8.out
ja_2038580_22.out   ja_2038580_36.out  ja_2038580_4.out   ja_2038580_63.out  ja_2038580_77.out  ja_2038580_90.out

Similarly, the data/ folder has 100 files:

[ldc19002@login4 r-example]$ ls data/
var_ja_100.rds  var_ja_19.rds  var_ja_28.rds  var_ja_37.rds  var_ja_46.rds  var_ja_55.rds  var_ja_64.rds  var_ja_73.rds  var_ja_82.rds  var_ja_91.rds
var_ja_10.rds   var_ja_1.rds   var_ja_29.rds  var_ja_38.rds  var_ja_47.rds  var_ja_56.rds  var_ja_65.rds  var_ja_74.rds  var_ja_83.rds  var_ja_92.rds
var_ja_11.rds   var_ja_20.rds  var_ja_2.rds   var_ja_39.rds  var_ja_48.rds  var_ja_57.rds  var_ja_66.rds  var_ja_75.rds  var_ja_84.rds  var_ja_93.rds
var_ja_12.rds   var_ja_21.rds  var_ja_30.rds  var_ja_3.rds   var_ja_49.rds  var_ja_58.rds  var_ja_67.rds  var_ja_76.rds  var_ja_85.rds  var_ja_94.rds
var_ja_13.rds   var_ja_22.rds  var_ja_31.rds  var_ja_40.rds  var_ja_4.rds   var_ja_59.rds  var_ja_68.rds  var_ja_77.rds  var_ja_86.rds  var_ja_95.rds
var_ja_14.rds   var_ja_23.rds  var_ja_32.rds  var_ja_41.rds  var_ja_50.rds  var_ja_5.rds   var_ja_69.rds  var_ja_78.rds  var_ja_87.rds  var_ja_96.rds
var_ja_15.rds   var_ja_24.rds  var_ja_33.rds  var_ja_42.rds  var_ja_51.rds  var_ja_60.rds  var_ja_6.rds   var_ja_79.rds  var_ja_88.rds  var_ja_97.rds
var_ja_16.rds   var_ja_25.rds  var_ja_34.rds  var_ja_43.rds  var_ja_52.rds  var_ja_61.rds  var_ja_70.rds  var_ja_7.rds   var_ja_89.rds  var_ja_98.rds
var_ja_17.rds   var_ja_26.rds  var_ja_35.rds  var_ja_44.rds  var_ja_53.rds  var_ja_62.rds  var_ja_71.rds  var_ja_80.rds  var_ja_8.rds   var_ja_99.rds
var_ja_18.rds   var_ja_27.rds  var_ja_36.rds  var_ja_45.rds  var_ja_54.rds  var_ja_63.rds  var_ja_72.rds  var_ja_81.rds  var_ja_90.rds  var_ja_9.rds

GNU Parallel approach

We can execute the same task using GNU Parallel. All the files to be created here will be placed in the same directory as the ones we created for the job array approach.

The submission script, in this case, will be named sim_gp.sh and looks as follows:

#!/bin/bash
#SBATCH --partition=general
#SBATCH --output=out/gp_%A.out
#SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources)
#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500M 

module load r/4.2.2 parallel

NSIM=100
ID_SIM=$(eval echo {1..$NSIM})
PARSC=$(/gpfs/sharedfs1/gnuparallel/parallel_opts.sh)
export OMP_NUM_THREADS=1

SCRIPT="Rscript sim_gp.R"

parallel $PARSC \
	 $SCRIPT \
	 ::: ${ID_SIM[@]}
	
rm *sshloginfile

The differences here are:

We need to create an “ID” variable by ourselves;
We pass these IDs as arguments to the R script;
The PAROPTS is a helper bash script loaded when loading the parallel module.

We need to modify the R script (this one we name sim_gp.R). The script for the gnu parallel approach looks as follows:

## storing the "task id"
i <- commandArgs(trailingOnly = TRUE)[1]

## setting seed for reproducibility
set.seed(as.integer(i))

## simulating data
n <- 100
sigma2 <- 4
y <- rnorm(n, sd = sqrt(sigma2))

## computing estimates
estimates <- c("s2" = var(y),
               "mle" = (1 - (1/n)) * var(y))

## saving results
filename <-
    sprintf("data/var_gp_%s.rds", i)
saveRDS(estimates, file = filename)

We are ready to execute our gnu parallel script. To do so, it suffices to run:

[ldc19002@login4 r-example]$ sbatch sim_gp.sh
Submitted batch job 2039098

Unlike the job array approach, we have a single output file for this job. You can double check the data directory to make sure all the output files were generated.

Processing results

Regardless of the method chosen, the results need to be processed. We can either transfer the raw output files to our machine and process them locally or write a script to process them on the cluster. In this post, we will take a look at the latter option. Extending it to the former is trivial.

A single core will be enough to process the files. Consider the following submission file:

#!/bin/bash
#SBATCH --partition=general
#SBATCH --output=out/processing_%A.out
#SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources)
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G

module load r/4.2.2

Rscript process.R

where the process.R script is as follows:

## value used to simulate the data
sigma2 <- 4

## reading and processing
files_ja <- list.files("data/", pattern = "^var_ja", full.names = TRUE)
results_ja <- sapply(files_ja, readRDS)
saveRDS(results_ja, "data/processed_ja.rds")

## reading and processing
files_gp <- list.files("data/", pattern = "^var_gp", full.names = TRUE)
results_gp <- sapply(files_gp, readRDS)
saveRDS(results_gp, "data/processed_gp.rds")

## mean of the estimates
cat("\nExpected values of the estimates: \n")
apply(results_gp, 1, mean)
apply(results_ja, 1, mean)

## median of the estimates
cat("\nMedian of the estimates: \n")
apply(results_gp, 1, median)
apply(results_ja, 1, median)

## Estimated bias
cat("\nBias: \n")
apply(results_gp - sigma2, 1, mean)
apply(results_ja - sigma2, 1, mean)

## MSE
cat("\nMSE: \n")
apply((results_gp - sigma2) *
     (results_gp - sigma2),
      1, mean)
apply((results_ja - sigma2) *
      (results_ja - sigma2),
      1, mean)

Finally, run the following chunk of code to run the script to process the output:

sbatch process.sh

If we check the data dir, we now see the processed files:

[ldc19002@login5 r-example]$ ls data/processed_*
data/processed_gp.rds  data/processed_ja.rds

The output file should contain the following output:

Loading r/4.2.2
  Loading requirement: gcc/11.3.0

Expected values of the estimates: 
      s2      mle 
3.957173 3.917601 
      s2      mle 
3.957173 3.917601 

Median of the estimates: 
      s2      mle 
3.912197 3.873075 
      s2      mle 
3.912197 3.873075 

Bias: 
         s2         mle 
-0.04282711 -0.08239884 
         s2         mle 
-0.04282711 -0.08239884 

MSE: 
       s2       mle 
0.3410326 0.3392379 
       s2       mle 
0.3410326 0.3392379

The estimates for the gnu parallel and job array approaches look exactly the same because we used the same seeds for the random number generators.