Build your own CDH5 QuickStart VM with Spark on CentOS

A great way to jump into CDH5 and Spark (with the latest version of Hue) is to build your own CDH5 setup on a VM.  As of this writing, a CDH5 QuickStart VM is not available (though you can download the Cloudera QuickStart VM for CDH4.5). Below are the steps to build your own CDH5 / Spark setup on CentOS 6.5.  Note, the installation of CDH5 through Cloudera Manager is actually quite straight forward.  Instead, these instructions focus on the steps prior to installing Cloudera Manager 5 (and the express install of CDH5) to minimize the hiccups you may run into.  These instructions after you’ve setup your CentOS VM – in my case I am using CentOS 6.5 (the latest download as of this writing)  and VMWare Fusion (for my Mac … and no, I’m not going to get into the Parallels vs. Mac debate!)

Basic Configuration

In this case, I’ve setup VMWare Fusion VM so that way i can get it up and running on my Mac (that and take backups / snapshots if and when I mess up the configuration).  It’s basic configuration is 4GB RAM, 2 cores, and 80GB of disk space with Bridged (Autodetect) network so it can have its own IP address.

Ensure your login has sudo access or able to log in as root

For this setup, I have a login of spark and I’ve added the spark login to the list of sudoers:

– login as root

– edit the sudoers list:

visudo –f /etc/sudoers

Validate Hostname

Ensure that the hostname and hosts file is setup correctly, this way both Cloudera Manager and CDH can work correctly.  As well, you need to keep localhost so that way if you choose to do embedded postgresql (for Hive, Oozie metastores) it will install correctly.  Note, this configuration works if you’re developing – if you are doing anything larger, it is recommended that you go with the remote database setup.  For example, Oozie will not be able to execute Hive jobs if the Hive metastore is configured locally.

/etc/sysconfig/network
    HOSTNAME=sparky 

/etc/hosts
    10.0.0.16   sparky
    127.0.0.1   localhost

A good way to validate the HostName is setup correctly is to check with the python script below (this is the script CM5 is using to validate the hostname)

python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

Opening up for connectivity

Another way to say this is that I’m opening up the surface area of attack.  For dev systems behind a firewall that contain non-sensitive data, these actions should be okay.  But please do so at your own risk. (sorry for the legal-ese here).

Disable SELinux
/etc/sysconfig/selinux 

Disable Firewall
System > Administration > Firewall 

Restart

These actions are required because you will need to diasable SELinux in order to install Cloudera Manager.  I disabled the firewall so as to need to open all of the different ports that CDH uses (i.e. being lazy here).  For these changes to take effect, you will need to restart.

Install Java

While not strictly required as Cloudera Manager and CDH5 typically includes the JDK, I usually do it anyways.  Since this is a dev setup on my Mac (VMWare Fusion running CentOS 6.5), then I chose the latest version of Java (as of this writing, it is JDK 7u51).   You can download the latest Linux x86 RPMs of Java at: http://www.java.com/en/download/help/linux_x64rpm_install.xml

rpm -ivh jdk-7u51-linux-x64.rpm

Optional Database Installation

If this is a production system, it is highly recommended that you follow these optional steps to install Postgres as a remote database (instead of an embedded database).   If you are building this for your own development purposes, using the automated installation of an embedded database works fine and is much easier.

Install and Configure Postgresql: Install and configure Postgres for use with Cloudera Manager / CDH

Post-Install Steps for Postgresql: Validate that you can utilize postgresql.

Install Cloudera Manager and then CDH 5

Now that you’ve done all the above steps, you can run the automated installation of Cloudera Manager.  Once this completes, it will jump into the express installation of CDH5.  The handy instructions include:

CDH5 Installation Guide

Installation Path A – Automated Installation by Cloudera Manager

The above link is handy because you can just click on the installation through the web browser and choose the appropriate configurations (e.g. YARN, Spark, etc.).  By default, Spark is included with the default CDH5 installation so you should be good to go provided you do not uncheck it.  As noted above, the installation of CM5 and CDH5 is relatively straightforward and easy.

Some Installation Tips

Swappiness Error Message

During the installation, you may get the following error message:

Cloudera recommends setting /proc/sys/vm/swappiness to 0. Current setting is 60. Use the sysctl command to change this setting at runtime and edit /etc/sysctl.conf for this setting to be saved after a reboot. You may continue with installation, but you may run into issues with Cloudera Manager reporting that your hosts are unhealthy because they are swapping. The following hosts are affected:

To resolve this, you can run the command:

sudo sysctl -w vm.swappiness=0

Reconfigure Disk space

When I built my CentOS VM, I originally had built it using the default 20GB of disk space and then wanted to expand it to 80GB.  In addition to wanting more disk space for data, Cloudera Manager has health alerts for the /opt and /var folders – if they go below 10GB of space, you will typically get alerts.  The /opt folder contains third party software including Cloudera’s parcels and parcel cache.  Meanwhile the /var folder will typically contain the Cloudera logs.  With a VM built by VMWare Fusion – typically you will have three partitions built with sda1 (the boot device), sda2 (contains everything), and sda3 (contains the Linux swap).

To reconfigure the disk space, there is an excellent blog: Live Resizing of an EXT4 FileSystem on Linux

My original setup had the configuration below

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      616447      307200   83  Linux
/dev/sda2          616448    37814271    18598912   83  Linux
/dev/sda3        37814272    41943039     2064384   82  Linux swap / Solaris

After resizing based on the linked instructions, now this is my setup.

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      616447      307200   83  Linux
/dev/sda2          616448   163643392    81513472+  83  Linux
/dev/sda3       163643393   167772159     2064383+  82  Linux swap / Solaris

 

Quick Links to Spark Tutorials

Below are some links to get you jump started on how to work with Spark:

Enjoy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s