Different R
packages and strategies can be used to run calculations in parallel. Some popular packages to achieve this task are parallel
, Rmpi
, doPar
, future
, and many others. These packages help you to parallelize mainly lapply
(and friends) calls. However, most online resources teach us how to use these packages to parallelize processes on a single machine.
To achieve multiple nodes parallelization, one can resort to future.batchtools
(link) or rslurm
(link).
Embarrassingly parallel jobs
Often, users need to run simulations using R
. In these cases, using other tools to achieve parallelization is easier (and more efficient). Currently, we have two options, job arrays
, and gnuparallel
. However, the former must be used cautiously as it can easily lead to a significant waste of computational resources, impacting other cluster users. The latter gives us more flexibility and is slightly harder to set up.
Toy example: estimating variance
To illustrate our two options, assume we are working with a simple simulation study where we simulate data from a Normal distribution with mean 0 and variance 4. Our end goal is to verify how well we are estimating the variance. Although extremely simple, this example follows the same directions as many simulation studies, and it is an embarrassingly parallel task. In addition, we will use R
in this toy example.
Job Arrays approach
Firstly, let us create some directories to store our results. The data
directory will store the results per se and the out
directory will store any outputs from our script.
mkdir data out
Now, let’s write the slurm
script (sim_ja.sh
) for the job array approach (for more information on job arrays, see SLURM Job Arrays). Our job array will run the sim_ja.R
script 100 times (25 at a time).
#!/bin/bash #SBATCH --partition=general #SBATCH --output=out/ja_%A_%a.out #SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources) #SBATCH --array=1-100%25 #SBATCH --ntasks=1 ## OBS 1 #SBATCH --cpus-per-task=1 ## OBS 2 #SBATCH --mem-per-cpu=500M ## OBS 3 module load r/4.2.2 ## avoiding implicit paralellism export OMP_NUM_THREADS=1 Rscript sim_ja.R
OBS 1: Most R
jobs are sequential. It is very unlikely that you will need more than 1 task.
OBS 2: It is unlikely that you are using any multithreaded code in R
.
OBS 3: Debug your code before hand to know how much RAM you need when executing your script.
These three observations are good practices meant to avoid waste of resources.
The sim_ja.R
will look as follows:
## storing the "task id" i <- Sys.getenv("SLURM_ARRAY_TASK_ID") ##--- setting seed for reproducibility ---- set.seed(as.integer(i)) ##--- simulating data ---- n <- 100 sigma2 <- 4 y <- rnorm(n, sd = sqrt(sigma2)) ##--- computing estimates ---- estimates <- c("s2" = var(y), "mle" = (1 - (1/n)) * var(y)) ##--- saving results ---- filename <- sprintf("data/var_ja_%s.rds", i) saveRDS(estimates, file = filename)
As usual, the sbatch
command is used to get our simulation started:
[ldc19002@login4 r-example]$ sbatch sim_ja.sh > Submitted batch job 2038716
Looking at the out/
dir, we have the following files
[ldc19002@login4 r-example]$ ls out/ ja_2038580_100.out ja_2038580_23.out ja_2038580_37.out ja_2038580_50.out ja_2038580_64.out ja_2038580_78.out ja_2038580_91.out ja_2038580_10.out ja_2038580_24.out ja_2038580_38.out ja_2038580_51.out ja_2038580_65.out ja_2038580_79.out ja_2038580_92.out ja_2038580_11.out ja_2038580_25.out ja_2038580_39.out ja_2038580_52.out ja_2038580_66.out ja_2038580_7.out ja_2038580_93.out ja_2038580_12.out ja_2038580_26.out ja_2038580_3.out ja_2038580_53.out ja_2038580_67.out ja_2038580_80.out ja_2038580_94.out ja_2038580_13.out ja_2038580_27.out ja_2038580_40.out ja_2038580_54.out ja_2038580_68.out ja_2038580_81.out ja_2038580_95.out ja_2038580_14.out ja_2038580_28.out ja_2038580_41.out ja_2038580_55.out ja_2038580_69.out ja_2038580_82.out ja_2038580_96.out ja_2038580_15.out ja_2038580_29.out ja_2038580_42.out ja_2038580_56.out ja_2038580_6.out ja_2038580_83.out ja_2038580_97.out ja_2038580_16.out ja_2038580_2.out ja_2038580_43.out ja_2038580_57.out ja_2038580_70.out ja_2038580_84.out ja_2038580_98.out ja_2038580_17.out ja_2038580_30.out ja_2038580_44.out ja_2038580_58.out ja_2038580_71.out ja_2038580_85.out ja_2038580_99.out ja_2038580_18.out ja_2038580_31.out ja_2038580_45.out ja_2038580_59.out ja_2038580_72.out ja_2038580_86.out ja_2038580_9.out ja_2038580_19.out ja_2038580_32.out ja_2038580_46.out ja_2038580_5.out ja_2038580_73.out ja_2038580_87.out ja_2038580_1.out ja_2038580_33.out ja_2038580_47.out ja_2038580_60.out ja_2038580_74.out ja_2038580_88.out ja_2038580_20.out ja_2038580_34.out ja_2038580_48.out ja_2038580_61.out ja_2038580_75.out ja_2038580_89.out ja_2038580_21.out ja_2038580_35.out ja_2038580_49.out ja_2038580_62.out ja_2038580_76.out ja_2038580_8.out ja_2038580_22.out ja_2038580_36.out ja_2038580_4.out ja_2038580_63.out ja_2038580_77.out ja_2038580_90.out
Similarly, the data/
folder has 100 files:
[ldc19002@login4 r-example]$ ls data/ var_ja_100.rds var_ja_19.rds var_ja_28.rds var_ja_37.rds var_ja_46.rds var_ja_55.rds var_ja_64.rds var_ja_73.rds var_ja_82.rds var_ja_91.rds var_ja_10.rds var_ja_1.rds var_ja_29.rds var_ja_38.rds var_ja_47.rds var_ja_56.rds var_ja_65.rds var_ja_74.rds var_ja_83.rds var_ja_92.rds var_ja_11.rds var_ja_20.rds var_ja_2.rds var_ja_39.rds var_ja_48.rds var_ja_57.rds var_ja_66.rds var_ja_75.rds var_ja_84.rds var_ja_93.rds var_ja_12.rds var_ja_21.rds var_ja_30.rds var_ja_3.rds var_ja_49.rds var_ja_58.rds var_ja_67.rds var_ja_76.rds var_ja_85.rds var_ja_94.rds var_ja_13.rds var_ja_22.rds var_ja_31.rds var_ja_40.rds var_ja_4.rds var_ja_59.rds var_ja_68.rds var_ja_77.rds var_ja_86.rds var_ja_95.rds var_ja_14.rds var_ja_23.rds var_ja_32.rds var_ja_41.rds var_ja_50.rds var_ja_5.rds var_ja_69.rds var_ja_78.rds var_ja_87.rds var_ja_96.rds var_ja_15.rds var_ja_24.rds var_ja_33.rds var_ja_42.rds var_ja_51.rds var_ja_60.rds var_ja_6.rds var_ja_79.rds var_ja_88.rds var_ja_97.rds var_ja_16.rds var_ja_25.rds var_ja_34.rds var_ja_43.rds var_ja_52.rds var_ja_61.rds var_ja_70.rds var_ja_7.rds var_ja_89.rds var_ja_98.rds var_ja_17.rds var_ja_26.rds var_ja_35.rds var_ja_44.rds var_ja_53.rds var_ja_62.rds var_ja_71.rds var_ja_80.rds var_ja_8.rds var_ja_99.rds var_ja_18.rds var_ja_27.rds var_ja_36.rds var_ja_45.rds var_ja_54.rds var_ja_63.rds var_ja_72.rds var_ja_81.rds var_ja_90.rds var_ja_9.rds
GNU Parallel approach
We can execute the same task using GNU Parallel. All the files to be created here will be placed in the same directory as the ones we created for the job array approach.
The submission script, in this case, will be named sim_gp.sh
and looks as follows:
#!/bin/bash #SBATCH --partition=general #SBATCH --output=out/gp_%A.out #SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources) #SBATCH --ntasks=50 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=500M module load r/4.2.2 parallel NSIM=100 ID_SIM=$(eval echo {1..$NSIM}) PARSC=$(/gpfs/sharedfs1/gnuparallel/parallel_opts.sh) export OMP_NUM_THREADS=1 SCRIPT="Rscript sim_gp.R" parallel $PARSC \ $SCRIPT \ ::: ${ID_SIM[@]} rm *sshloginfile
The differences here are:
We need to create an “ID” variable by ourselves;
We pass these IDs as arguments to the
R
script;The
PAROPTS
is a helper bash script loaded when loading theparallel
module.
We need to modify the R
script (this one we name sim_gp.R
). The script for the gnu parallel approach looks as follows:
## storing the "task id" i <- commandArgs(trailingOnly = TRUE)[1] ## setting seed for reproducibility set.seed(as.integer(i)) ## simulating data n <- 100 sigma2 <- 4 y <- rnorm(n, sd = sqrt(sigma2)) ## computing estimates estimates <- c("s2" = var(y), "mle" = (1 - (1/n)) * var(y)) ## saving results filename <- sprintf("data/var_gp_%s.rds", i) saveRDS(estimates, file = filename)
We are ready to execute our gnu parallel script. To do so, it suffices to run:
[ldc19002@login4 r-example]$ sbatch sim_gp.sh Submitted batch job 2039098
Unlike the job array approach, we have a single output file for this job. You can double check the data
directory to make sure all the output files were generated.
Processing results
Regardless of the method chosen, the results need to be processed. We can either transfer the raw output files to our machine and process them locally or write a script to process them on the cluster. In this post, we will take a look at the latter option. Extending it to the former is trivial.
A single core will be enough to process the files. Consider the following submission file:
#!/bin/bash #SBATCH --partition=general #SBATCH --output=out/processing_%A.out #SBATCH --constraint="cpuonly" ## avoiding gpu (preventing waste of resources) #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=2G module load r/4.2.2 Rscript process.R
where the process.R
script is as follows:
## value used to simulate the data sigma2 <- 4 ## reading and processing files_ja <- list.files("data/", pattern = "^var_ja", full.names = TRUE) results_ja <- sapply(files_ja, readRDS) saveRDS(results_ja, "data/processed_ja.rds") ## reading and processing files_gp <- list.files("data/", pattern = "^var_gp", full.names = TRUE) results_gp <- sapply(files_gp, readRDS) saveRDS(results_gp, "data/processed_gp.rds") ## mean of the estimates cat("\nExpected values of the estimates: \n") apply(results_gp, 1, mean) apply(results_ja, 1, mean) ## median of the estimates cat("\nMedian of the estimates: \n") apply(results_gp, 1, median) apply(results_ja, 1, median) ## Estimated bias cat("\nBias: \n") apply(results_gp - sigma2, 1, mean) apply(results_ja - sigma2, 1, mean) ## MSE cat("\nMSE: \n") apply((results_gp - sigma2) * (results_gp - sigma2), 1, mean) apply((results_ja - sigma2) * (results_ja - sigma2), 1, mean)
Finally, run the following chunk of code to run the script to process the output:
sbatch process.sh
If we check the data
dir, we now see the processed files:
[ldc19002@login5 r-example]$ ls data/processed_* data/processed_gp.rds data/processed_ja.rds
The output file should contain the following output:
Loading r/4.2.2 Loading requirement: gcc/11.3.0 Expected values of the estimates: s2 mle 3.957173 3.917601 s2 mle 3.957173 3.917601 Median of the estimates: s2 mle 3.912197 3.873075 s2 mle 3.912197 3.873075 Bias: s2 mle -0.04282711 -0.08239884 s2 mle -0.04282711 -0.08239884 MSE: s2 mle 0.3410326 0.3392379 s2 mle 0.3410326 0.3392379
The estimates for the gnu parallel and job array approaches look exactly the same because we used the same seeds for the random number generators.