Learnings from Running Spark at Twitter

As part of the Seattle Spark Meetup series, we had a great Learnings from Running Spark at Twitter session at the @TwitterSeattle Offices.  We (Seattle Spark Meetup organizers) want to thank Sriram Krishnan (@krishnansriram) and Benjamin Hindman (@benh) for presenting and Jeff Currier (@jeff_currier) and @TwitterSeattle for hosting us! As well, we had raffled off Paco Nathan’s (@pacoid) Hands-On Apache Spark Workshop (Seattle) – the winner is Monir Abu Hilal (@monirabuhilal )!  If you want to learn more about Spark, I highly recommend his course! Below are the links to the two sessions: Spark at Twitter: Evaluation & Lessons Learnt by @krishnansriram and Mesos for Spark Users…

Rate this:

Build your own CDH5 QuickStart VM with Spark on CentOS

A great way to jump into CDH5 and Spark (with the latest version of Hue) is to build your own CDH5 setup on a VM.  As of this writing, a CDH5 QuickStart VM is not available (though you can download the Cloudera QuickStart VM for CDH4.5). Below are the steps to build your own CDH5 / Spark setup on CentOS 6.5.  Note, the installation of CDH5 through Cloudera Manager is actually quite straight forward.  Instead, these instructions focus on the steps prior to installing Cloudera Manager 5 (and the express install of CDH5) to minimize the hiccups you may run…

Rate this:

Quick Tip for Compressing Many Small Text Files within HDFS via Pig

One of the good (or bad depending on your point of view) habits when working with Hadoop is that you can push your files into the Hadoop cluster and worry about making sense of this data at a later time.  One of the many issues with this approach is that you may rapidly run out of disk space on your cluster or your cloud storage.  A good way to alleviate this issue (outside of deleting the data) is to compress the data within HDFS. More information on how the script works are embedded within the comments. /* ** Pig Script:…

Rate this:

Quick Tip for extracting SQL Server data to Hive

While I have documented various techniques to transfer data from Hadoop to SQL Server / Analysis Services (e.g. How Klout changed the landscape of social media with Hadoop and BI Slides Updated, SQL Server Analysis Services to Hive, etc.), this post calls out the reverse – how to quickly extract SQL Server data to Hadoop / Hive.   This is a common scenario where SQL Server is being used as your transactional store and you want to push data to some other repository for analysis where you are mashing together semi-structured and structured data. How to minimize impact on SQL Server…

Rate this:

Hive and Windows Auth – the curse of the backslash

Captain Avery: Put down the sword. A sword could kill us all, girl. Amy:  Yeah. Thanks. That’s actually why I’m pointing it at you. — from “Doctor Who: The Curse of the Black Spot” Background Typically when you get Hive / Hadoop up and running, everything runs pretty smoothly especially if you use one of the demo VMs (e.g. I’m currently using the Cloudera QuickStart VM). But if you are in production and you want to secure login access to your environment, you may have Windows authentication turned on for access to one of the boxes on your Hadoop cluster…

Rate this:

Quick Tech Tip: SETting Cloudera Hue Beeswax to create a compressed Hive table

I’m currently playing with CDH 4.1 and was having fun with Hue – specifically Beeswax to execute Hive queries from a nice web UI. As noted in Hadoop compression codecs and optimizing Hive joins (and using compression to do it), using compression gives you more space and in many cases can improve query performance.  Yet to my dismay, when I tried to execute a bunch of SET statements, I ended up getting  the OK FAILED parse exception. Of course this is what happens when you haven’t played a particular tech in awhile and don’t bother to do tutorials!  On the…

Rate this:

Why all this interest in Spark?

“Spark … is what you might call a Swiss Army knife of Big Data analytics tools” — Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead The above quote – from the Wired article “Spark: Open Source Superstar Rewrites Future of Big Data” – encompasses why I am a fan of Spark.  If you are an avid hiker or outdoors-person, you already appreciate the flexibility of a Swiss Army Knife (or Leatherman).  It is the perfect compact tool to do a variety of simple but necessary tasks – bordering on life saving (below is a picture from my ascent to Mount…

Rate this: