...
When the time limit is reached for the submission script, the Apache SPARC stop-all script will stop all workers and the master process.
PySPARK
The installation of the Apache SPARK suite with hadoop functionality provides a built-in PySPARK and Python version used for PySPARK calculations.
PySPARK can be enabled by adding the python and pyspark paths to the previous local loadable module.
Here is an example of the local Apache SPARK
Code Block |
---|
#%Module1.0
## Spark 3.5.0 modulefile
##
proc ModulesHelp { } {
puts stderr "Provides the Spark environment for large-scale data processing"
puts stderr "This version has been built with hadoop 3.3"
}
module-whatis "Apache Spark™ is a unified analytics engine for large-scale data processing."
set prefix /gpfs/homefs1/netidhere/ApacheSpark/3.5.0/spark-3.5.0-bin-hadoop3
prepend-path PATH $prefix/bin
prepend-path PATH $prefix/sbin
prepend-path PATH $prefix/python
prepend-path PATH $prefix/python/pyspark
#set SPARK_HOME $prefix
#set HADOOP_HOME $prefix
|
Then once the module is loaded, pyspark can be called:
Code Block |
---|
$ module purge
$ module load spark/3.5.0-mine
$ pyspark
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/28 13:00:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.11.5 (main, Sep 11 2023 13:54:46)
Spark context Web UI available at http://cn462.storrs.hpc.uconn.edu:4040
Spark context available as 'sc' (master = local[*], app id = local-1709143216797).
SparkSession available as 'spark'.
>>> |
The web UI is also available, however, the link that PySPARK provides will not work. To access the web UI the IP address of the compute node that the PySPARK job is running on would need to be entered instead.
Code Block |
---|
http://IPaddressOfComputeNodeHere.storrs.hpc.uconn.edu:4040 |
To find the IP address the following command can be entered outside of PySPARK:
Code Block |
---|
nslookup cnXX (where XX is the compute node number) |
OR a user can ping the node and the response back will be the compute node’s IP address.
Code Block |
---|
ping cnXX (where XX is the compute node number) |
Once the IP address is placed within the HTTP link, the Web UI should load and look like the following:
...
Submission script
The spark-submit command from the previous submission script example should be able to run the Spark Python code.
Another way to call pyspark and pass the python script would be the following.
Code Block |
---|
#!/bin/bash
#SBATCH --job-name=jobNameHere # create a short name for your job
#SBATCH --nodes=2 # node count
#SBATCH --ntasks=252
#SBATCH --mem=20G # memory per node
#SBATCH --time=00:05:00 # total run time limit (HH:MM:SS)
#SBATCH --no-requeue
module load spark/3.5.0-mine
start-all.sh
echo $MASTER | tee master.txt
pyspark < script.py
stop-all.sh |