Skip to main content

Spark Cluster Setup on EC2

The Spark Cluster can be setup in multiple ways

  • Standalone mode
  • Using Hadoop YARN
  • Using Apache MESOS
  • Using Amazon EC2


Here, we will discuss on how to setup Spark cluster on Amazon EC2.

These are some minimum requirements for setting up Spark on EC2:

  • Account in AWS
  • Need to create IAM (Identity & Access Management) USER with secure Access Key Id & Secret Access Key
  • Need to create IAM Group and associate it with above user
  • Need to create EC2 Key Pair
  • Download latest spark version in our local box (Linux/Unix)


Let’s go through the above points step by step

  • Account Creation in AWS: If there is an existing account in AWS, it can be used or else a new account has to be created.
  •  IAM Group Creation:
    1. After logging to Amazon Console using the credentials, the following path needs to be created IAM Group
    2. AWS Console home > IAM (Identity & Access Management) > Groups- A group needs to be created with AmazonEC2FullAccess policy
  • IAM USER Creation: IAM USER can be created using the path

AWS Console home > IAM (Identity & Access Management) > Users.

While creating a USER, a group should be assigned which has AmazonEC2FullAccess policy. After successful creation of the USER, the Access Key ID and Secret Access Key will be provided. We need to save this information for future use.

  • EC2 key-Pair Creation:


The EC2 KEY-PAIR can be created using the path mentioned below:

AWS Console home > EC2 > Key Pairs > Create Key Pair.
Once, the key is created, the same has to be downloaded to local.

After implementing the above steps, the task related to AWS is completed. Now we need to download the Spark latest into our local box. Once, the download is done navigate to Spark ec2 folder. (<SPARK_HOME>ec2). Spark has provided dedicated script to setup Spark cluster on EC2.

Before running the spark_ec2.py script we need to export the amazon secure access key id and key using the below command

Export AWS_ACCESS_KEY_ID =”<key_ID_created_in_step_2>”
Export AWS_SECRET_ACCESS_KEY=”<key_created_in_step_2>”

Once this is done we are ready to run the script. Below is the syntax to launch Spark cluster on EC2.

Launch the cluster

. /spark-ec2 -k <key_pair> -i <key_file> -s <num-slaves> launch <cluster-name>

Where
<key_pair> is the name of EC2 key pair (that was given when created)
<key_file> is the private key file for the given key pair, <num-slaves> is the number of slave nodes to launch (try 1 at first), and <cluster-name> is the cluster name.

For example:

. /spark-ec2 -k my_key_pair -i my_key_pair .pem-s 1 launch spark_ec2

Apart from above arguments there are various additional arguments, unless provided with new values their default values will be used. Some of the arguments are mentioned below:

-t or -- instance-type: By default m1.large EC2 instance will be selected. We can mention an instance type but need to make sure that it is 64 bit. Please keep in mind m1.large instance cost you.

-r or -- region: Used to select the EC2 region, the default value is us-east-1.

-z or -- zone: Used to select any particular zone, by default AWS will automatically select the available zone for creating EC2 instances. If trying to give an own value, we need to make sure there are available instances in the respective zone, if not it is going to throw an error at the time of cluster launch. Hence, it is better not to mention any value for this argument and let the AWS decide the same.

Similarly, there are several arguments which are selected by default. If interested we can go through them in the spark_ec2.py file which is under Spark EC2 folder.

In the above example, the number of slaves is chosen as 1. Any number of slaves can be chosen based on the requirement. If it is for testing purpose, it’s better to select one slave initially. Once the launch is successful we should be able to see the below logs in the shell

...........

Shutting down GANGLIA gmetad:                              [FAILED]
Starting GANGLIA gmetad:                                   [  OK  ]
Stopping httpd:                                            [FAILED]
Starting httpd:                                            [  OK  ]
Connection to ec2-54-152-174-199.compute-1.amazonaws.com closed.
Spark standalone cluster started at
http://ec2-54-152-174-199.compute-1.amazonaws.com:8080
Ganglia started at
http://ec2-54-152-174-199.compute-1.amazonaws.com:5080/ganglia
Done!

We can access the spark cluster using the below URL
http://ec2-54-152-174-199.compute-1.amazonaws.com:8080

Once we access the master node we will be able to see a worker/slave entry in the page. When we click on that respective link, it will navigate us to the slave/worker page. Apart from Spark cluster, Ganglia is additionally installed on the cluster. We can access the same using URL http://ec2-54-152-174-199.compute1.amazonaws.com:5080/ganglia

Stop the cluster

If we want to stop the cluster we can use the below command

 ./spark-ec2  stop spark_ec2

Then it will ask for confirmation with below kind of message on the shell

Are you sure you want to stop the cluster spark_ec2?
DATA ON EPHEMERAL DISKS WILL BE LOST, BUT THE CLUSTER WILL
KEEP USING SPACE ON AMAZON EBS IF IT IS EBS-BACKED!!
All data on spot-instance slaves will be lost.
Stop cluster spark_ec2 (y/N):


Destroy the cluster

If we want to destroy the cluster we can use the below command

./spark-ec2  destroy spark_ec2

Then, it will ask for confirmation with the below kind of message on the shell

Are you sure you want to destroy the cluster spark_ec2?
The following instances will be terminated:
Searching for existing cluster spark_ec2...
Found 1 master(s), 1 slaves


ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster spark_ec2 (y/N):

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com

Other useful links:

Email Classifier using Mahout on Hadoop

Custom Processing using Apache Pig UDFs (User Defined Functions)

Installing SolrCloud on Hadoop

Let’s engage