Spark Ecosystem & Spark Streaming Fundamentals

This is a re-post from Spark Ecosystem & Spark Streaming Fundamentals post on the Concur blog.

For the March 2015 Seattle Spark Meetup group convened for our monthly meeting. It was a special meeting since it was actually a joint Seattle Spark Meetup and Pacific Northwest Cloudera User Group session held at the Concur Technologies headquarters in Bellevue. We had special guest speakers, Mike Olson (@mikeolson) and Hari Shreedharan (@harisr1234), whose presentations discussed the basics of Spark and Spark Streaming as well as the Spark ecosystem, specifically, Cloudera’s investments into it.

For those who missed it, we’re sharing Mike and Hari’s presentations below, the abstracts from their discussions, as well as a few images from the meeting.

If these topics are of interesting to you, go here to learn more about the monthly Seattle Spark Meetup.

Spark Ecosystem

Mike Olson, Founder and Chief Strategy Officer, Cloudera

Cloudera has been an active sponsor of, and participant in, the UC Berkeley AMPLab since 2009, and was involved in some of the earliest design discussions for Spark. Matei Zaharia spent two summers as an intern at Cloudera while in graduate school, and we continued to monitor the progress of the project over his years at Berkeley. In 2013, after the formation of Databricks, we negotiated a reseller relationship with Databricks and brought Spark into the Cloudera product, as yet another execution engine, alongside MapReduce, Impala, HBase and others.

We were the first Hadoop vendor to see the potential of Spark for the Hadoop ecosystem and to pull it into our offering. In the years since, most other vendors have followed suit. We are, however, still the major commercial distributor for Spark, and the only company with a Hadoop-based platform to be actively involved in Spark, with contributors and committers on staff. We work closely with the global Spark community to enhance the software and to integrate it with the security, multi-tenancy, data governance and other services in our enterprise big data platform.

In this talk, I will describe Cloudera’s strategic commitment to Spark, our practical investment in the community, and how our customers are using the software. While I will touch on some technical features of Spark that make it valuable to Cloudera, this will not be primarily a technical talk.

Slides: Cloudera’s Investments in the Spark Ecosystem

Spark-March2-crop-300x212

Mike Olson and Hari Shreedharan presenting at the March 2015 Seattle Spark Meetup

Spark Streaming Fundamentals

Hari Shreedharan,  Software Engineer @ Cloudera, Contributor to Apache Spark

Apache Spark is a flexible, scalable and fault-tolerant data processing framework that specializes in processing large amount of data. Spark Streaming builds on top of the core library to consume data from ingest systems like Apache Kafka, Apache Flume, Amazon Kinesis etc., in real time and processes the incoming data in micro-batches every few seconds.

In this talk, we will talk about the basics of Spark and Spark Streaming. We will discuss the Spark Streaming’s basic programming framework, how it can be used to process data in real time. We will also discuss the recent advances in Spark Streaming – the design of several new features that have improved performance and eliminated any possibility of data loss.

Slides: Real Time Data Processing using Spark Streaming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s