Apache Spark

Apache Spark is a data processing framework that can help with big data and machine learning environments.

To run Apache Spark on HPC, there are different options that are available for processing machine learning data.

Spark Shell:

Apache Spark has an interactive shell that can be used to run various data and code within the shell environment.

Interactive job:

The Apache Spark Shell can be launched through an interactive SLURM job on HPC.

To spawn an interactive SLURM job on HPC, there are two command options that can be used:

srun -N 1 -n 126 -p general --pty bash

or

fisbatch -N 1 -n 126 --partition=general

Once a node is assigned to the interactive SLURM job, Apache Spark can be loaded and the Spark Shell called.

Loading the Apache Spark module and calling the Apache Spark interactive shell:

The following commands will load the spark/3.1.1 module and setup the Apach Spark Spack environment.

[netid@cnXX ~]$ module load spark/3.1.1
[netid@cnXX ~]$ source /gpfs/sharedfs1/admin/hpc2.0/apps/spark/3.1.1/spack/share/spack/setup-env.sh
[netid@cnXX ~]$ spack load spark
[netid@cnXX ~]$ spark-shell

The Apache Spark Shell will load but provide warning messages:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/gpfs/sharedfs1/admin/hpc2.0/apps/spark/3.1.1/spack/opt/spack/linux-rhel8-zen2/gcc-11.3.0/spark-3.1.1-5asotiovqn6j5vhujukzig73hoajf23s/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2023-12-01 10:02:32,858 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://cnXX.storrs.hpc.uconn.edu:4040
Spark context available as 'sc' (master = local[*], app id = local-1701442957118).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

To quit out of spark-shell:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.20.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :quit

Spark Shell Web Browser UI

When Spark Shell is loaded, Spark Shell will setup a Web UI context on the compute node that Spark Shell launches as.

To use the Web UI, the IP address of the compute node would need to be entered as a replacement for the compute node name that Spark Shell provides.

Spark context Web UI available at http://cnXX.storrs.hpc.uconn.edu:4040

To find the IP address of the assigned compute node, a separate window/terminal session on HPC would need to be opened and the following command entered in the new terminal:

nslookup cnXX

Where XX is the compute node number.

The IP address will be provided with the 137.99.x.x format.

Copy the 137.99.x.x IP address and replace the compute node name in the above link with the IP address for the compute node.

http://IPAddressOfComputeNode.storrs.hpc.uconn.edu:4040

Once entered in a browser on the local PC, the Apache Spark UI should load and look like the following:

When the Apache Spark Shell job is finished, feel free to close the browser tab and end Apache Spark Shell mentioned in the previous section above.