Arkansas High Performace Computing Center [hpcwiki]

Spark

Apache Spark version 1.6.1 is installed in /share/apps/spark. Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It allows users to combine the memory and cpus of multiple compute nodes into into a Spark cluster and use the aggregated cluster memory and cpus to run a single task.

The example PBS script below sets up a 3 node spark cluster in standalone mode using 3 compute nodes on Trestles.

#PBS -l walltime=30:00
#PBS -q q30m32c
#PBS -l nodes=3:ppn=32

module load java/sunjdk_1.8.0_92
module load spark

MASTER=`head -1 $PBS_NODEFILE`;
PORT=7077
spark_master="spark://$MASTER:$PORT"

echo -n "starting spark master on $MASTER... ";
$SPARK_HOME/sbin/start-master.sh
echo "done";
sleep 2;
echo "spark cluster web interface: http://$MASTER:8080"  >$HOME/spark-info
echo "           spark master URL: spark://$MASTER:7077" >>$HOME/spark-info

for n in `uniq $PBS_NODEFILE`; do
   echo -n "starting spark slave on $n..."
   if [ "$n" == "$MASTER" ]; then
      $SPARK_HOME/sbin/start-slave.sh $spark_master
   else
      ssh $n "module load java/sunjdk_1.8.0_92 spark; $SPARK_HOME/sbin/start-slave.sh $spark_master";
   fi
   echo "done";
done

sleep $PBS_WALLTIME;

When the job starts all 3 nodes are running the worker service. The job head node is also running the spark master service. The log file from the spark master is saved to:

/home/$USER/spark-$USER-org.apache.spark.deploy.master.Master-1-$HOST.out

and worker logs are in:

/home/$USER/spark-$USER-org.apache.spark.deploy.worker.Worker-<workernum>-$HOST.out

In addition to the log files a new file in the $HOME/spark-info contains the URL of the spark master web interface:

tres-l1:pwolinsk:$ cat spark-info 
spark cluster web interface: http://tres1005:8080
           spark master URL: spark://tres1005:7077

You can use the lynx text based web browser to connect to the spark cluster web interface and check the status of the cluster:

tres-l1:pwolinsk:$ lynx http://tres1005:8080
                                                                                             
Spark Master at spark://tres1005:7077
[spark-logo-77x50px-hd.png] 1.6.1 Spark Master at spark://tres1005:7077

     * URL: spark://tres1005:7077
     * REST URL: spark://tres1005:6066 (cluster mode)
     * Alive Workers: 3
     * Cores in use: 96 Total, 0 Used
     * Memory in use: 185.7 GB Total, 0.0 B Used
     * Applications: 0 Running, 0 Completed
     * Drivers: 0 Running, 0 Completed
     * Status: ALIVE

Workers

   Worker Id
   Address
   State

   Cores Memory
   worker-20160512160005-172.16.10.5-47819 172.16.10.5:47819 ALIVE 32 (0 Used) 61.9 GB (0.0 B Used)
   worker-20160512160007-172.16.10.8-53663 172.16.10.8:53663 ALIVE 32 (0 Used) 61.9 GB (0.0 B Used)
   worker-20160512160009-172.16.10.9-60317 172.16.10.9:60317 ALIVE 32 (0 Used) 61.9 GB (0.0 B Used)

Running Applications

   Application ID
   Name

   Cores Memory per Node Submitted Time User State Duration

Completed Applications

   Application ID
   Name

   Cores Memory per Node Submitted Time User State Duration

The spark cluster will remain running for the requested duration of the job (i.e. job's requested walltime), 30 minutes in this example.

Submitting Jobs to Spark

spark-submit script can be used to submit a job to the spark cluster directly from one of the trestles login nodes (i.e. tres-l1 or tres-l2). The –master has to be used to specify the spark master URL (found in $HOME/spark-info file). Below the SparkPi example from the $SPARKHOME/lib/park-examples-1.6.1-hadoop2.6.0.jar is ran using 3 executors (workers), each using 32 cores and 60GB of memory. <code> tres-l1:pwolinsk:$ module load java/sunjdk1.8.0_92 spark/1.6.1 tres-l1:pwolinsk:$ spark-submit –class org.apache.spark.examples.SparkPi –master spark:tres1005:7077 –num-executors 3 –executor-memory 60g –executor-cores 32 /share/apps/spark/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar 10000 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/05/12 16:11:20 INFO SparkContext: Running Spark version 1.6.1 16/05/12 16:11:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 16/05/12 16:11:21 INFO SecurityManager: Changing view acls to: pwolinsk 16/05/12 16:11:21 INFO SecurityManager: Changing modify acls to: pwolinsk 16/05/12 16:11:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(pwolinsk); users with modify permissions: Set(pwolinsk) 16/05/12 16:11:21 INFO Utils: Successfully started service 'sparkDriver' on port 45770. 16/05/12 16:11:22 INFO Slf4jLogger: Slf4jLogger started 16/05/12 16:11:22 INFO Remoting: Starting remoting 16/05/12 16:11:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp:sparkDriverActorSystem@172.16.6.14:56961] … </code>