On-Time Flight Performance with GraphFrames for Apache Spark

Feature Image: NASA Goddard Space Flight Center: City Lights of the United States 2012 This is an abridged version of the full blog post On-Time Flight Performance with GraphFrames. You can also reference the webinar GraphFrames: DataFrame-based graphs for Apache Spark and the On-Time Flight Performance with GraphFrames for Apache Spark notebook. An intuitive approach to understanding flight departure delays is to use graph structures. Why Graph? The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths.  It is easier to understand these data problems…

Rate this:

Data Exploration with Databricks

Today, it was also featured on InsideBigData: Data Exploration with Databricks.  Awesome!   This Data Exploration on Databricks jump start video will show you how go from data source to visualization in a few easy steps. Specifically, we will take semi-structured logs, easily extract and transform them, analyze and visualize the data using Spark SQL, so we can quickly understand our data. For more information and to check out other Spark notebooks, check out Selected Notebooks > Databricks Jump Start.  

Rate this:

Apache Spark is the Smartphone of Big Data

Similar to the way the smartphone changed the way we communicate – far beyond its original goal of mobile voice telephony – Apache Spark is revolutionizing Big Data. While portability may have been the catalyst of the mobile revolution, it was the ability to have one device perform multiple tasks very well with the ability to easily build and use a diverse range of applications that are the keys to its ubiquity. Ultimately, with the smartphone we have a general platform that has changed the way we communicate, socialize, work, and play. The smartphone has not only replaced older technologies…

Rate this:

Data Engineering Reading Materials: Spark, Machine Learning, and Distributed Systems Resources

Over the last few weeks, a regular question that I’ve been asked are where I can find resources about Spark, Machine Learning, and Distributed Systems.  While they seem to be disparate problems, the fact is that as a Data Engineer (or someone in Data Sciences Engineering or a Data Scientist that loves scalability and performance) you need to have your feet wet in all three disciplines to truly excel. Apache Spark Let’s start with Apache Spark (disclosure, I am with Databricks – the company was founded by the creators of Apache Spark).   I am a big fan of Apache Spark because of its…

Rate this:

Notebook Gallery

Here are some of the notebooks created to showcase various Apache Spark use cases. These are all using Databricks Community Edition which you can get at Try Databricks. You can also access the source from : https://github.com/dennyglee/databricks. JSON Support GLM in SparkR Window Functions  Random Forests DataFrame API ML Operations   Decision Trees Statistical Functions  Data Import  Data Exploration Quick Start Python Quick Start Scala  Ad-Tech Example Flight Delays  Genomics Mobile Sample   Pop vs. Price LR  Pop vs. Price DF  Salesforce Leads Spark 1.6 (Multiple)   Spark 1.6  

Rate this:

Quick Tip: Dropping Phantom Hive Databases (e.g. CDH5 Canary test dB)

While I’m a big fan of CDH5 and Hue – sometimes I will see some funkiness that’s a tad irritating.  Specifically, there is a database with a name similar to cloudera_manager_metastore_canary_test_db_hive_hivemetastore_$guid$_2014_10_06_11_20_41 Even more irritating there is a table called cm_test_table which cannot be deleted (or renamed or even described). hive> describe cm_test_table; FAILED: SemanticException [Error 10001]: Table not found cm_test_table hive> alter table cm_test_table RENAME to cm_test_table2; FAILED: SemanticException [Error 10001]: Table not found cm_test_table hive> drop table cm_test_table; FAILED: SemanticException [Error 10001]: Table not found cm_test_table To work around this problem, its a matter of using the CASCADE reference to…

Rate this:

Spark atop Mesos on Google Cloud Platform querying Google Cloud Storage

A great reason to jump into Spark on Mesos on Google Cloud Platform is because you can quickly spin up a development environment to work with Spark, Mesos, Google Cloud, and Marathon together very quickly. A great way to set this up is to follow the steps in Paco Nathan’s (@pacoid) great blog post Spark atop Mesos on Google Cloud Platform. But what’s missing from this configuration is the ability to connect to Google Cloud Storage (GCS) so you can run your Spark queries off of a persistent elastic storage. As noted in the diagram below, you will first install Spark…

Rate this: