Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When the Apache Spark Shell job is finished, feel free to close the browser tab and end Apache Spark Shell mentioned in the previous section above.

Local Spark Standalone cluster install to allow for cluster worker processes

The global Apache SPARK module available on HPC only allows for single node basic SPARK command usage that does not spawn a cluster or worker processes.

Users are not allowed to write log files that SPARK generates when running the global version if a SPARK cluster is spawned. A permission denied message will occur.

As an alternative, a local installation of Apache SPARK is needed on HPC, for users wishing to spawn worker processes on compute nodes through Spark’s standalone cluster feature.

The following steps will show how Apache Spark can be installed locally under a user’s account, how to create a local loadable Spark module, and an example on how to setup a submission script to submit to compute node(s) on HPC and spawn worker processes.

Create a couple of directories to begin installing Apache Spark

Code Block
[netidhere@login5 ~]$ mkdir -pv ApacheSpark/3.5.0
[netidhere@login5 ~]$ mkdir -pv ApacheSpark/buildfiles

Download Apache Spark into the buildfiles directory and unpack the file into the ApacheSpark/3.5.0 directory

Code Block
[netidhere@login5 ~]$ cd ApacheSpark/buildfiles

Click the Pre-Built Apache Spark Binary link from their Download page.
Copy the link in the new page that opens up.

[netidhere@login5 buildfiles] wget PasteHTTPSLinkForApacheSparkHere
...
...
[netidhere@login5 buildfiles] ls
spark-3.5.0-bin-hadoop3.tgz

[netidhere@login5 buildfiles] tar -xzvf spark-3.5.0-bin-hadoop3.tgz -C ../3.5.0

Create a loadable Apache Spark module to point to the new installation.

The following example will show how to create a local directory called mod to create a new module file for Apache Spark 3.5.0

Code Block
[netidhere@login5 ~]$ mkdir mod
[netidhere@login5 ~]$ cd mod
[netidhere@login5 mod]$ mkdir spark
[netidhere@login5 mod]$ cd spark
[netidhere@login5 spark]$ vi 3.5.0-mine

Copy and paste are the following contents of the 3.5.0-mine Apache Spark module

Code Block
#%Module1.0
## Spark 3.5.0 modulefile
##
proc ModulesHelp { } {
    puts stderr "Provides the Spark environment for large-scale data processing"
    puts stderr "This version has been built with hadoop 3.3"
}

module-whatis "Apache Spark™ is a unified analytics engine for large-scale data processing."

set prefix /gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3

prepend-path PATH $prefix/bin
prepend-path PATH $prefix/sbin

Update the prefix path with your netid in place of the netidhere specification.

Save the file with :wq!

Set the MODULEPATH variable in the local user’s .bashrc file to add the new Apache SPARK module to allow the module to be loaded within a submission script.

Code Block
vi ~/.bashrc

Add the following lines to the .bashrc file:

Code Block
# My modules
source /etc/profile.d/modules.sh
MODULEPATH=${HOME}/mod:${MODULEPATH}

Save the .bashrc file with :wq!

The new spark/3.5.0-mine module will become available to load and use in job submissions.

Create a job submission script:

The following example will provide the contents of a submission script that can be used to load this new spark module, submit to 2 nodes and assign 252 cores, allocate 20GB of RAM, spawn a master process on one compute node, spawn worker(s) on assigned compute nodes, run a given script, and stop all worker/master processes when the script finishes.

Update the #SBATCH --job-name declaration with a job name for current job.

Update the #SBATCH --time declaration with the needed run time for the job.

Code Block
#!/bin/bash
#SBATCH --job-name=jobNameHere      # create a short name for your job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks=252
#SBATCH --mem=20G                # memory per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --no-requeue

module load spark/3.5.0-mine

start-all.sh
echo $MASTER | tee master.txt

spark-submit --total-executor-cores 252  <code script looking to run here>

stop-all.sh

Once the needed changes are made, save the above submission script.

Access Apache Spark web browser interface after submitting the above submission script:

Apache Spark will create an output file for all logging and information regarding worker deployment.

To access the Web GUI interface through the Master, a logging file will be generated with the following format: (where XX is the compute node number for the worker)

Code Block
/gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3/logs/spark-netidhere-org.apache.spark.deploy.worker.Worker-1-cnXX.out

View the contents of the file to find the Master process.

The worker WebUI would be the link to use within a browser.

Code Block
 WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cnXX.storrs.hpc.uconn.edu:8083

Where XX is the compute node name, however, the compute node resolving does not work. To access the WebUI, the IP address of the compute node would need to be entered instead.

Code Block
http://IPaddressOfNodeHere.storrs.hpc.uconn.edu:8083

To find the IP address of the node, one can use the following command (replace XX with the compute node number):

Code Block
nslookup cnXX

The IP address should be provided, copy the IP address and paste it in the above http link to the port that Apache Spark assigned the worker process.

The WebUI should show up successfully and look like the following:

...

Info

If port 8083 does not work, sometimes port 8082 works at the end of the link. Replace 8083 with 8082 if connection to the web UI can not be established. Also, make sure the link has http and NOT https.

When the time limit is reached for the submission script, the Apache SPARC stop-all script will stop all workers and the master process.