User General user info Spark

From Begrid Wiki
Jump to navigationJump to search


Apache Spark can be used on our cluster in the so-called "Standalone Mode". In short, the way to do it is the following : you launch the spark-master on a user interface (UI), and then you can launch the spark-slave daemons using the local submission.

However, there is a serious limitation : HDFS is not available in our cluster, and so it is not suitable to handle data files whose size would exceed a few gigabytes.

How to manage and use your Spark standalone cluster ?

Starting a cluster is really a child play provided that you use our home-brewed script. But first, you must get a local account so that you can login on one of our UIs (explained here).

Login on a UI (m6 in the following example) :

ssh <your_login>

Once on the UI, go to the Spark directory :

cd /swmgrs/beappssgm/spark-1.6.1

Start a cluster like this :

./ --start

The default number of workers is 4. If you need more (8 for example), use the '--nbslaves' argument :

./ --start --nbslaves 8

Check the status of your cluster :

./ --status

The output of this command shows the statuses of the workers. Iterate this command until all the workers are running.

Start to use your cluster. For example, let's use the spark-shell :

./bin/spark-shell --master spark://

You can also monitor your cluster and your tasks with a smart webui by opening the following URL in your browser :

(This is an example, the actual URL to use appears in the output of the start command.)

Once you've done with your work, don't forget to stop the cluster :

./ --stop

Skip this section if you are happy with the script explained above...

First, you must get a local account on one of our UIs. (It is explained here.)

Login on a UI (m6 in the following example) :

ssh <your_login>

After the login, execute the following commands :

export SPARK_MASTER_IP=<private_ip_of_the_ui>
cd /swmgrs/beappssgm/spark-1.6.1

Have a look at the logfile to check that everything is ok :

tail -f /swmgrs/beappssgm/spark-1.6.1/logs/spark-sgerard-org.apa....out

(please adapt the last command : the exact name of the logfile will show up in the output of the

Type CTRL+C when you want to stop the tail command.

If everything went well, now you have a Spark master machine, and you need to add some slaves to it to get a real cluster. You will create the slaves by using local submission. First, you need a script with the following content :


cd /swmgrs/beappssgm/spark-1.6.1
./sbin/ spark://
sleep 180
./sbin/ spark://

It's just an example : you need to adapt the sleep (change the number of seconds, 180 in the example, of course it must be larger than the estimated duration of your workflow), and also the URL of the spark master (only the private IP is expected to change).

Now, it's time to submit :

qsub -q localgrid@cream02

Repeat the previous command as many times as the number of cores you need.

The jobs might not start immediately (if the cluster is very loaded, you might spend some time in the waiting queue...) Check the status of your jobs with the command :

qstat -u <your_user_name>

Once all the job are running, you've got your Spark standalone cluster at the desired capacity, and your are ready to test it :

./bin/spark-shell --master spark://
scala> val testFile = sc.textFile("/user/sgerard/README")
scala> testFile.count()

Once you've finished your work, you must shut down the cluster :