Apache Spark

Apache Spark is a data processing framework that can help with big data and machine learning environments.

To run Apache Spark on HPC, there are different options that are available for processing machine learning data.

Spark Shell:

Apache Spark has an interactive shell that can be used to run various data and code within the shell environment.

Interactive job:

The Apache Spark Shell can be launched through an interactive SLURM job on HPC.

To spawn an interactive SLURM job on HPC, there are two command options that can be used:

srun -N 1 -n 126 -p general --pty bash

or

fisbatch -N 1 -n 126 --partition=general

Once a node is assigned to the interactive SLURM job, Apache Spark can be loaded and the Spark Shell called.

Loading the Apache Spark module and calling the Apache Spark interactive shell:

The following commands will load the spark/3.1.1 module and setup the Apach Spark Spack environment.

[netid@cnXX ~]$ module load spark/3.1.1
[netid@cnXX ~]$ source /gpfs/sharedfs1/admin/hpc2.0/apps/spark/3.1.1/spack/share/spack/setup-env.sh
[netid@cnXX ~]$ spack load spark
[netid@cnXX ~]$ spark-shell

The Apache Spark Shell will load but provide warning messages:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/gpfs/sharedfs1/admin/hpc2.0/apps/spark/3.1.1/spack/opt/spack/linux-rhel8-zen2/gcc-11.3.0/spark-3.1.1-5asotiovqn6j5vhujukzig73hoajf23s/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-12-01 10:02:32,858 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://cnXX.storrs.hpc.uconn.edu:4040
Spark context available as 'sc' (master = local[*], app id = local-1701442957118).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

To quit out of spark-shell:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :quit

Spark Shell Web Browser UI

When Spark Shell is loaded, Spark Shell will setup a Web UI context on the compute node that Spark Shell launches as.

To use the Web UI, the IP address of the compute node would need to be entered as a replacement for the compute node name that Spark Shell provides.

Spark context Web UI available at http://cnXX.storrs.hpc.uconn.edu:4040

To find the IP address of the assigned compute node, a separate window/terminal session on HPC would need to be opened and the following command entered in the new terminal:

nslookup cnXX

Where XX is the compute node number.

The IP address will be provided with the 137.99.x.x format.

Copy the 137.99.x.x IP address and replace the compute node name in the above link with the IP address for the compute node.

http://IPAddressOfComputeNode.storrs.hpc.uconn.edu:4040

Once entered in a browser on the local PC, the Apache Spark UI should load and look like the following:

When the Apache Spark Shell job is finished, feel free to close the browser tab and end Apache Spark Shell mentioned in the previous section above.

Local Spark Standalone cluster install to allow for cluster worker processes

The global Apache SPARK module available on HPC only allows for single node basic SPARK command usage that does not spawn a cluster or worker processes.

Users are not allowed to write log files that SPARK generates when running the global version if a SPARK cluster is spawned. A permission denied message will occur.

As an alternative, a local installation of Apache SPARK is needed on HPC, for users wishing to spawn worker processes on compute nodes through Spark’s standalone cluster feature.

The following steps will show how Apache Spark can be installed locally under a user’s account, how to create a local loadable Spark module, and an example on how to setup a submission script to submit to compute node(s) on HPC and spawn worker processes.

Create a couple of directories to begin installing Apache Spark

[netidhere@login5 ~]$ mkdir -pv ApacheSpark/3.5.0
[netidhere@login5 ~]$ mkdir -pv ApacheSpark/buildfiles

Download Apache Spark into the buildfiles directory and unpack the file into the ApacheSpark/3.5.0 directory

[netidhere@login5 ~]$ cd ApacheSpark/buildfiles

Click the Pre-Built Apache Spark Binary link from their Download page.
Copy the link in the new page that opens up.

[netidhere@login5 buildfiles] wget PasteHTTPSLinkForApacheSparkHere
...
...
[netidhere@login5 buildfiles] ls
spark-3.5.0-bin-hadoop3.tgz

[netidhere@login5 buildfiles] tar -xzvf spark-3.5.0-bin-hadoop3.tgz -C ../3.5.0

Create a loadable Apache Spark module to point to the new installation.

The following example will show how to create a local directory called mod to create a new module file for Apache Spark 3.5.0

[netidhere@login5 ~]$ mkdir mod
[netidhere@login5 ~]$ cd mod
[netidhere@login5 mod]$ mkdir spark
[netidhere@login5 mod]$ cd spark
[netidhere@login5 spark]$ vi 3.5.0-mine

Copy and paste are the following contents of the 3.5.0-mine Apache Spark module

#%Module1.0
## Spark 3.5.0 modulefile
##
proc ModulesHelp { } {
    puts stderr "Provides the Spark environment for large-scale data processing"
    puts stderr "This version has been built with hadoop 3.3"
}

module-whatis "Apache Spark™ is a unified analytics engine for large-scale data processing."

set prefix /gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3

prepend-path PATH $prefix/bin
prepend-path PATH $prefix/sbin

Update the prefix path with your netid in place of the netidhere specification.

Save the file with :wq!

Set the MODULEPATH variable in the local user’s .bashrc file to add the new Apache SPARK module to allow the module to be loaded within a submission script.

vi ~/.bashrc

Add the following lines to the .bashrc file:

# My modules
source /etc/profile.d/modules.sh
MODULEPATH=${HOME}/mod:${MODULEPATH}

Save the .bashrc file with :wq!

The new spark/3.5.0-mine module will become available to load and use in job submissions.

Create a job submission script:

The following example will provide the contents of a submission script that can be used to load this new spark module, submit to 2 nodes and assign 252 cores, allocate 20GB of RAM, spawn a master process on one compute node, spawn worker(s) on assigned compute nodes, run a given script, and stop all worker/master processes when the script finishes.

Update the #SBATCH --job-name declaration with a job name for current job.

Update the #SBATCH --time declaration with the needed run time for the job.

#!/bin/bash
#SBATCH --job-name=jobNameHere      # create a short name for your job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks=252
#SBATCH --mem=20G                # memory per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --no-requeue

module load spark/3.5.0-mine

start-all.sh
echo $MASTER | tee master.txt

spark-submit --total-executor-cores 252  <code script looking to run here>

stop-all.sh

Once the needed changes are made, save the above submission script.

Access Apache Spark web browser interface after submitting the above submission script:

Apache Spark will create an output file for all logging and information regarding worker deployment.

To access the Web GUI interface through the Master, a logging file will be generated with the following format: (where XX is the compute node number for the worker)

/gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3/logs/spark-netidhere-org.apache.spark.deploy.worker.Worker-1-cnXX.out

View the contents of the file to find the Master process.

The worker WebUI would be the link to use within a browser.

 WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://cnXX.storrs.hpc.uconn.edu:8083

Where XX is the compute node name, however, the compute node resolving does not work. To access the WebUI, the IP address of the compute node would need to be entered instead.

http://IPaddressOfNodeHere.storrs.hpc.uconn.edu:8083

To find the IP address of the node, one can use the following command (replace XX with the compute node number):

nslookup cnXX

The IP address should be provided, copy the IP address and paste it in the above http link to the port that Apache Spark assigned the worker process.

The WebUI should show up successfully and look like the following:

If port 8083 does not work, sometimes port 8082 works at the end of the link. Replace 8083 with 8082 if connection to the web UI can not be established. Also, make sure the link has http and NOT https.

When the time limit is reached for the submission script, the Apache SPARC stop-all script will stop all workers and the master process.

PySPARK

The installation of the Apache SPARK suite with hadoop functionality provides a built-in PySPARK and Python version used for PySPARK calculations.

PySPARK can be enabled by adding the python and pyspark paths to the previous local loadable module.

Here is an example of the local Apache SPARK

#%Module1.0
## Spark 3.5.0 modulefile
##
proc ModulesHelp { } {
    puts stderr "Provides the Spark environment for large-scale data processing"
    puts stderr "This version has been built with hadoop 3.3"
}

module-whatis "Apache Spark™ is a unified analytics engine for large-scale data processing."

set prefix /gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3

prepend-path PATH $prefix/bin
prepend-path PATH $prefix/sbin
prepend-path PATH $prefix/python
prepend-path PATH $prefix/python/pyspark

#set SPARK_HOME $prefix
#set HADOOP_HOME $prefix

Then once the module is loaded, pyspark can be called:

$ module purge
$ module load spark/3.5.0-mine
$ pyspark
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/28 13:00:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Python version 3.11.5 (main, Sep 11 2023 13:54:46)
Spark context Web UI available at http://cn462.storrs.hpc.uconn.edu:4040
Spark context available as 'sc' (master = local[*], app id = local-1709143216797).
SparkSession available as 'spark'.
>>>

The web UI is also available, however, the link that PySPARK provides will not work. To access the web UI the IP address of the compute node that the PySPARK job is running on would need to be entered instead.

http://IPaddressOfComputeNodeHere.storrs.hpc.uconn.edu:4040

To find the IP address the following command can be entered outside of PySPARK:

nslookup cn462

OR a user can ping the node and the response back will be the compute node’s IP address.

ping cn462

Once the IP address is placed within the HTTP link, the Web UI should load and look like the following: