As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure. Key issues include on the ability to easily visualize, share, deploy, and schedule jobs. More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production. With Databricks, data scientists and engineers can simplify these logistical issues and spend more of their time focusing on their data problems. Simplify Visualization An important perspective for data scientists and engineers is the ability to quickly visualize the data and the model…
Category: Spark
Notebook Gallery
Here are some of the notebooks created to showcase various Apache Spark use cases. These are all using Databricks Community Edition which you can get at Try Databricks. You can also access the source from : https://github.com/dennyglee/databricks. JSON Support GLM in SparkR Window Functions Random Forests DataFrame API ML Operations Decision Trees Statistical Functions Data Import Data Exploration Quick Start Python Quick Start Scala Ad-Tech Example Flight Delays Genomics Mobile Sample Pop vs. Price LR Pop vs. Price DF Salesforce Leads Spark 1.6 (Multiple) Spark 1.6
Spark Ecosystem & Spark Streaming Fundamentals
This is a re-post from Spark Ecosystem & Spark Streaming Fundamentals post on the Concur blog. For the March 2015 Seattle Spark Meetup group convened for our monthly meeting. It was a special meeting since it was actually a joint Seattle Spark Meetup and Pacific Northwest Cloudera User Group session held at the Concur Technologies headquarters in Bellevue. We had special guest speakers, Mike Olson (@mikeolson) and Hari Shreedharan (@harisr1234), whose presentations discussed the basics of Spark and Spark Streaming as well as the Spark ecosystem, specifically, Cloudera’s investments into it. For those who missed it, we’re sharing Mike and…
Presentation: Concur Discovers the True Value of Data
Concur, the leading provider of spend management solutions and services, will be joining us to discuss how they implemented Cloudera for data discovery and analytics. Using an enterprise data hub, Concur was able to provide their data scientists a centralized environment that allowed for faster and smarter analytic development.
Feb Spark Events: Data Discovery, Dato & Spark, and Spark Camp
An awesomely busy February coming up for those whom are interested in all things Apache Spark! Concur Discovers the True Value of Data A joint Cloudera and Concur webinar on February 2nd, 2015 where we will discuss the benefits of utilizing CDH5 within Concur’s modern Big Data architecture (including Spark of course!) Better Together: Dato + GraphLab We’ve got a great Seattle Spark Meetup event feature three speakers highlighting the integration between Dato’s GraphLab Create and Apache Spark. Come join us at Concur’s new Bellevue Training Center to learn more about GraphLab Create, Apache Spark, and network. Strata +…
Spark atop Mesos on Google Cloud Platform querying Google Cloud Storage
A great reason to jump into Spark on Mesos on Google Cloud Platform is because you can quickly spin up a development environment to work with Spark, Mesos, Google Cloud, and Marathon together very quickly. A great way to set this up is to follow the steps in Paco Nathan’s (@pacoid) great blog post Spark atop Mesos on Google Cloud Platform. But what’s missing from this configuration is the ability to connect to Google Cloud Storage (GCS) so you can run your Spark queries off of a persistent elastic storage. As noted in the diagram below, you will first install Spark…
Yes, you can connect Tableau to SparkSQL (Spark 1.1)
As a data scientist and engineer, I appreciate that Apache Spark has many components to make it easy to analyze, gain insight, and to generate recommendations from my data. However, as noted within my previous presentation , one of the things missing is an easy way for analysts to visualize their data. The good news is there is an easy way to gain visuals of your data by connecting Tableau to SparkSQL! As noted in my Tableau Data14 presentation (slides are embedded below), there is an unofficial method to connect Tableau to SparkSQL. For more information, please read on at An Absolutely…
The Future of Hadoop: A deeper look at Apache Spark
Understand why Apache Spark has experienced such wide adoption and learn about some Spark use cases today. There is also a technical deep dive into the architecture, and our vision for the Hadoop ecosystem and why we believe Spark is the successor to MapReduce for Hadoop data processing. As well, here’s the link to The Future of Hadoop: A deeper look at Apache Spark webinar.