A great way to jump into CDH5 and Spark (with the latest version of Hue) is to build your own CDH5 setup on a VM. As of this writing, a CDH5 QuickStart VM is not available (though you can download the Cloudera QuickStart VM for CDH4.5). Below are the steps to build your own CDH5 / Spark setup on CentOS 6.5. Note, the installation of CDH5 through Cloudera Manager is actually quite straight forward. Instead, these instructions focus on the steps prior to installing Cloudera Manager 5 (and the express install of CDH5) to minimize the hiccups you may run into. These instructions after you’ve setup your CentOS VM – in my case I am using CentOS 6.5 (the latest download as of this writing) and VMWare Fusion (for my Mac … and no, I’m not going to get into the Parallels vs. Mac debate!)
In this case, I’ve setup VMWare Fusion VM so that way i can get it up and running on my Mac (that and take backups / snapshots if and when I mess up the configuration). It’s basic configuration is 4GB RAM, 2 cores, and 80GB of disk space with Bridged (Autodetect) network so it can have its own IP address.
Ensure your login has sudo access or able to log in as root
For this setup, I have a login of spark and I’ve added the spark login to the list of sudoers:
– login as root
– edit the sudoers list:
Ensure that the hostname and hosts file is setup correctly, this way both Cloudera Manager and CDH can work correctly. As well, you need to keep localhost so that way if you choose to do embedded postgresql (for Hive, Oozie metastores) it will install correctly. Note, this configuration works if you’re developing – if you are doing anything larger, it is recommended that you go with the remote database setup. For example, Oozie will not be able to execute Hive jobs if the Hive metastore is configured locally.
A good way to validate the HostName is setup correctly is to check with the python script below (this is the script CM5 is using to validate the hostname)
Opening up for connectivity
Another way to say this is that I’m opening up the surface area of attack. For dev systems behind a firewall that contain non-sensitive data, these actions should be okay. But please do so at your own risk. (sorry for the legal-ese here).
These actions are required because you will need to diasable SELinux in order to install Cloudera Manager. I disabled the firewall so as to need to open all of the different ports that CDH uses (i.e. being lazy here). For these changes to take effect, you will need to restart.
While not strictly required as Cloudera Manager and CDH5 typically includes the JDK, I usually do it anyways. Since this is a dev setup on my Mac (VMWare Fusion running CentOS 6.5), then I chose the latest version of Java (as of this writing, it is JDK 7u51). You can download the latest Linux x86 RPMs of Java at: http://www.java.com/en/download/help/linux_x64rpm_install.xml
Optional Database Installation
If this is a production system, it is highly recommended that you follow these optional steps to install Postgres as a remote database (instead of an embedded database). If you are building this for your own development purposes, using the automated installation of an embedded database works fine and is much easier.
Install and Configure Postgresql: Install and configure Postgres for use with Cloudera Manager / CDH
Post-Install Steps for Postgresql: Validate that you can utilize postgresql.
Install Cloudera Manager and then CDH 5
Now that you’ve done all the above steps, you can run the automated installation of Cloudera Manager. Once this completes, it will jump into the express installation of CDH5. The handy instructions include:
The above link is handy because you can just click on the installation through the web browser and choose the appropriate configurations (e.g. YARN, Spark, etc.). By default, Spark is included with the default CDH5 installation so you should be good to go provided you do not uncheck it. As noted above, the installation of CM5 and CDH5 is relatively straightforward and easy.
Some Installation Tips
Swappiness Error Message
During the installation, you may get the following error message:
Cloudera recommends setting /proc/sys/vm/swappiness to 0. Current setting is 60. Use the sysctl command to change this setting at runtime and edit /etc/sysctl.conf for this setting to be saved after a reboot. You may continue with installation, but you may run into issues with Cloudera Manager reporting that your hosts are unhealthy because they are swapping. The following hosts are affected:
To resolve this, you can run the command:
Reconfigure Disk space
When I built my CentOS VM, I originally had built it using the default 20GB of disk space and then wanted to expand it to 80GB. In addition to wanting more disk space for data, Cloudera Manager has health alerts for the /opt and /var folders – if they go below 10GB of space, you will typically get alerts. The /opt folder contains third party software including Cloudera’s parcels and parcel cache. Meanwhile the /var folder will typically contain the Cloudera logs. With a VM built by VMWare Fusion – typically you will have three partitions built with sda1 (the boot device), sda2 (contains everything), and sda3 (contains the Linux swap).
To reconfigure the disk space, there is an excellent blog: Live Resizing of an EXT4 FileSystem on Linux
My original setup had the configuration below
After resizing based on the linked instructions, now this is my setup.
Quick Links to Spark Tutorials
Below are some links to get you jump started on how to work with Spark:
- Spark Quick Start Documentation
- Data Exploration Using Spark
- Movie Recommendation with MLLib
- Graph Analytics with GraphX
- Working with Tachyon