6. Hadoop and Spark Job Helper

Hadoop Job

We provide Hadoop framework and auxiliary script to help users to run Hadoop jobs on the HPC clusters in Hadoop On Demand (HOD) fashion. The auxiliary script “hadoop_helper.sh” is located in /global/home/groups/allhands/bin/hadoop_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Hadoop environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running “source /global/home/groups/allhands/bin/hadoop_helper.sh” (only bash is supported right now). Please run “hadoop-usage” to see how to run Hadoop jobs. You will need to run “hadoop-start” to initialize an HOD environment and run “hadoop-stop” to destroy the HOD environment after your Hadoop job completes.

Below shows how to use it interactively.
[joe@n0000.scs00 ~]$ srun -p lr2 –qos=lr_debug -A lr_abc -N 4 -t 10:0 –pty bash
[joe@n0138.lr2 ~]$ source /global/home/groups/allhands/bin/hadoop_helper.sh
[joe@n0138.lr2 ~]$ hadoop-start
starting jobtracker, …
[joe@n0138.lr2 bash.738294]$ hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000
Number of Maps = 4
…
Estimated value of Pi is 3.14140000000000000000
[joe@n0138.lr2 bash.738294]$ hadoop-stop
stopping jobtracker
…
Below shows how to use it in a job script.

#!/bin/bash
#SBATCH --job-name=hadoop
#SBATCH --partition=lr2
#SBATCH --qos=lr_debug
#SBATCH --account=ac_abc
#SBATCH --nodes=4
#SBATCH --time=00:10:00
#
source /global/home/groups/allhands/bin/hadoop_helper.sh
#
# Start Hadoop On Demand
hadoop-start
#
# Example 1
hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar pi 4 10000
#
# Example 2
mkdir in
cp /foo/bar in/
hadoop jar $HADOOP_DIR/hadoop-examples-1.2.1.jar wordcount in out
#
# Stop Hadoop On Demand
hadoop-stop

Spark Job

We provide Spark framework and auxiliary script to help users to run Spark jobs on the HPC clusters in Spark On Demand (SOD) fashion. The auxiliary script “spark_helper.sh” is located in /global/home/groups/allhands/bin/spark_helper.sh and can be used interactively or from a job script. Please note that this script only provides functions to help to build a Spark environment, so it should never be run directly. The proper way to use it is to source it from your current environment by running “source /global/home/groups/allhands/bin/spark_helper.sh” (only bash is supported right now). After that please run “spark-usage” to see how to run Spark jobs. You will need to run “spark-start” to initialize an SOD environment and run “spark-stop” to destroy the SOD environment after your Spark job completes.

Below shows how to use it interactively.
[joe@n0000.scs00 ~]$ srun -p lr2 –qos=lr_debug -A lr_abc -N 4 -t 10:0 –pty bash
[joe@n0138.lr2 ~]$ source /global/home/groups/allhands/bin/spark_helper.sh
[joe@n0138.lr2 ~]$ spark-start
starting org.apache.spark.deploy.master.Master, …
[joe@n0138.lr2 bash.738307]$ spark-submit –master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py
Spark assembly has been built with Hive
…
Pi is roughly 3.147280
[joe@n0138.lr2 bash.738307]$ spark-stop
…
Below shows how to use it in a job script.

#!/bin/bash
#SBATCH --job-name=spark
#SBATCH --partition=lr2
#SBATCH --qos=lr_debug
#SBATCH --account=ac_abc
#SBATCH --nodes=4
#SBATCH --time=00:10:00

source /global/home/groups/allhands/bin/spark_helper.sh

# Start Spark On Demand
spark-start

# Example 1
spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/pi.py

# Example 2
spark-submit --master $SPARK_URL $SPARK_DIR/examples/src/main/python/wordcount.py /foo/bar

# Stop Spark On Demand
spark-stop

One Level Up

Hadoop Job

Spark Job

Operations

IT Division

IT Help Desk