Apache Spark

Apache Spark is a data processing framework that can help with big data and machine learning environments.

To run Apache Spark on HPC, there are different options that are available for processing machine learning data.

Spark Shell:

Apache Spark has an interactive shell that can be used to run various data and code within the shell environment.

Interactive job:

The Apache Spark Shell can be launched through an interactive SLURM job on HPC.

To spawn an interactive SLURM job on HPC, there are two command options that can be used:

srun -N 1 -n 126 -p general --pty bash

or

fisbatch -N 1 -n 126 --partition=general

Once a node is assigned to the interactive SLURM job, Apache Spark can be loaded and the Spark Shell called.

Loading the Apache Spark module and calling the Apache Spark interactive shell:

The following commands will load the spark/3.1.1 module and setup the Apach Spark Spack environment.

[netid@cnXX ~]$ module load spark/3.1.1
[netid@cnXX ~]$ source /gpfs/sharedfs1/admin/hpc2.0/apps/spark/3.1.1/spack/share/spack/setup-env.sh
[netid@cnXX ~]$ spack load spark
[netid@cnXX ~]$ spark-shell

The Apache Spark Shell will load but provide warning messages:

To quit out of spark-shell:

Spark Shell Web Browser UI

When Spark Shell is loaded, Spark Shell will setup a Web UI context on the compute node that Spark Shell launches as.

To use the Web UI, the IP address of the compute node would need to be entered as a replacement for the compute node name that Spark Shell provides.

To find the IP address of the assigned compute node, a separate window/terminal session on HPC would need to be opened and the following command entered in the new terminal:

Where XX is the compute node number.

The IP address will be provided with the 137.99.x.x format.

Copy the 137.99.x.x IP address and replace the compute node name in the above link with the IP address for the compute node.

Once entered in a browser on the local PC, the Apache Spark UI should load and look like the following:

When the Apache Spark Shell job is finished, feel free to close the browser tab and end Apache Spark Shell mentioned in the previous section above.

Local Spark Standalone cluster install to allow for cluster worker processes

The global Apache SPARK module available on HPC only allows for single node basic SPARK command usage that does not spawn a cluster or worker processes.

Users are not allowed to write log files that SPARK generates when running the global version if a SPARK cluster is spawned. A permission denied message will occur.

As an alternative, a local installation of Apache SPARK is needed on HPC, for users wishing to spawn worker processes on compute nodes through Spark’s standalone cluster feature.

The following steps will show how Apache Spark can be installed locally under a user’s account, how to create a local loadable Spark module, and an example on how to setup a submission script to submit to compute node(s) on HPC and spawn worker processes.

Create a couple of directories to begin installing Apache Spark

Download Apache Spark into the buildfiles directory and unpack the file into the ApacheSpark/3.5.0 directory

Create a loadable Apache Spark module to point to the new installation.

The following example will show how to create a local directory called mod to create a new module file for Apache Spark 3.5.0

Copy and paste are the following contents of the 3.5.0-mine Apache Spark module

Update the prefix path with your netid in place of the netidhere specification.

Save the file with :wq!

Set the MODULEPATH variable in the local user’s .bashrc file to add the new Apache SPARK module to allow the module to be loaded within a submission script.

Add the following lines to the .bashrc file:

Save the .bashrc file with :wq!

The new spark/3.5.0-mine module will become available to load and use in job submissions.

Create a job submission script:

The following example will provide the contents of a submission script that can be used to load this new spark module, submit to 2 nodes and assign 252 cores, allocate 20GB of RAM, spawn a master process on one compute node, spawn worker(s) on assigned compute nodes, run a given script, and stop all worker/master processes when the script finishes.

Update the #SBATCH --job-name declaration with a job name for current job.

Update the #SBATCH --time declaration with the needed run time for the job.

Once the needed changes are made, save the above submission script.

Access Apache Spark web browser interface after submitting the above submission script:

Apache Spark will create an output file for all logging and information regarding worker deployment.

To access the Web GUI interface through the Master, a logging file will be generated with the following format: (where XX is the compute node number for the worker)

View the contents of the file to find the Master process.

The worker WebUI would be the link to use within a browser.

Where XX is the compute node name, however, the compute node resolving does not work. To access the WebUI, the IP address of the compute node would need to be entered instead.

To find the IP address of the node, one can use the following command (replace XX with the compute node number):

The IP address should be provided, copy the IP address and paste it in the above http link to the port that Apache Spark assigned the worker process.

The WebUI should show up successfully and look like the following:

If port 8083 does not work, sometimes port 8082 works at the end of the link. Replace 8083 with 8082 if connection to the web UI can not be established. Also, make sure the link has http and NOT https.

When the time limit is reached for the submission script, the Apache SPARC stop-all script will stop all workers and the master process.

PySPARK

The installation of the Apache SPARK suite with hadoop functionality provides a built-in PySPARK and Python version used for PySPARK calculations.

PySPARK can be enabled by adding the python and pyspark paths to the previous local loadable module.

Here is an example of the local Apache SPARK

Then once the module is loaded, pyspark can be called:

The web UI is also available, however, the link that PySPARK provides will not work. To access the web UI the IP address of the compute node that the PySPARK job is running on would need to be entered instead.

To find the IP address the following command can be entered outside of PySPARK:

OR a user can ping the node and the response back will be the compute node’s IP address.

Once the IP address is placed within the HTTP link, the Web UI should load and look like the following:

Submission script

The spark-submit command from the previous submission script example should be able to run the Spark Python code.

Another way to call pyspark and pass the python script would be the following.

Storrs HPC