Apache Spark is the Smartphone of Big Data

Similar to the way the smartphone changed the way we communicate – far beyond its original goal of mobile voice telephony – Apache Spark is revolutionizing Big Data. While portability may have been the catalyst of the mobile revolution, it was the ability to have one device perform multiple tasks very well with the ability to easily build and use a diverse range of applications that are the keys to its ubiquity. Ultimately, with the smartphone we have a general platform that has changed the way we communicate, socialize, work, and play. The smartphone has not only replaced older technologies…

Rate this:

Interested in career in Data Sciences? Read Freakonomics first!

Over the last few years, a common question I’ve been asked is what does it take to become a data scientist?  Often my answers surrounded the technology – i.e. learn Spark, Python, and/or R; take courses in Data Sciences; play with data sets; etc.   Yet, I was never fully satisfied with that answer because I had always felt that the heart of Data Sciences (and Big Data in more generic terms) is the data – or more specifically, the ability to understand the data. Recently, I re-read “Freakonomics: A Rogue Economist Explores the Hidden Side of Everything” and it dawned…

Rate this:

Data Engineering Reading Materials: Spark, Machine Learning, and Distributed Systems Resources

Over the last few weeks, a regular question that I’ve been asked are where I can find resources about Spark, Machine Learning, and Distributed Systems.  While they seem to be disparate problems, the fact is that as a Data Engineer (or someone in Data Sciences Engineering or a Data Scientist that loves scalability and performance) you need to have your feet wet in all three disciplines to truly excel. Apache Spark Let’s start with Apache Spark (disclosure, I am with Databricks – the company was founded by the creators of Apache Spark).   I am a big fan of Apache Spark because of its…

Rate this:

Simplify Machine Learning on Spark with Databricks

As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure.  Key issues include on the ability to easily visualize, share, deploy, and schedule jobs.  More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production.  With Databricks, data scientists and engineers can simplify these logistical issues and spend more of their time focusing on their data problems. Simplify Visualization An important perspective for data scientists and engineers is the ability to quickly visualize the data and the model…

Rate this:

Notebook Gallery

Here are some of the notebooks created to showcase various Apache Spark use cases. These are all using Databricks Community Edition which you can get at Try Databricks. You can also access the source from : https://github.com/dennyglee/databricks. JSON Support GLM in SparkR Window Functions  Random Forests DataFrame API ML Operations   Decision Trees Statistical Functions  Data Import  Data Exploration Quick Start Python Quick Start Scala  Ad-Tech Example Flight Delays  Genomics Mobile Sample   Pop vs. Price LR  Pop vs. Price DF  Salesforce Leads Spark 1.6 (Multiple)   Spark 1.6  

Rate this:

Join the WearHacks experience!

If you are interested in learning more about wearables, check out WearHacks.com – one of the fastest growing communities empowering hackers on wearable technology.   Last year, WearHacks Montréal was the biggest wearable hackathon in North America.  In 2015, there are already 13 cities hosting hackathons with mentors, industry experts, and access to technology so you can build new wearable and connected technologies. Below is an impressive infographic on how much WearHacks has grown in the last year. Check out the full infographic at: https://magic.piktochart.com/output/5369103-wearhacks_-2014-vs-2015#

Rate this:

Spark Ecosystem & Spark Streaming Fundamentals

This is a re-post from Spark Ecosystem & Spark Streaming Fundamentals post on the Concur blog. For the March 2015 Seattle Spark Meetup group convened for our monthly meeting. It was a special meeting since it was actually a joint Seattle Spark Meetup and Pacific Northwest Cloudera User Group session held at the Concur Technologies headquarters in Bellevue. We had special guest speakers, Mike Olson (@mikeolson) and Hari Shreedharan (@harisr1234), whose presentations discussed the basics of Spark and Spark Streaming as well as the Spark ecosystem, specifically, Cloudera’s investments into it. For those who missed it, we’re sharing Mike and…

Rate this: