On-Time Flight Performance with GraphFrames for Apache Spark

Feature Image: NASA Goddard Space Flight Center: City Lights of the United States 2012

This is an abridged version of the full blog post On-Time Flight Performance with GraphFrames. You can also reference the webinar GraphFrames: DataFrame-based graphs for Apache Spark and the On-Time Flight Performance with GraphFrames for Apache Spark notebook.


An intuitive approach to understanding flight departure delays is to use graph structures.

Why Graph?

The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths.  It is easier to understand these data problems within the context of graph structures: vertices, edges, and properties.   For example, flight data analysis is a classic graph problem:

  • airports are represented by vertices
  • flights are represented by edges.
  • numerous propertiesassociated with these flights including but not limited to departure delays, plane type, and carrier.

Jumping into GraphFrames

GraphFrames for Apache Spark supports general graph processing and is able to take advantages of the performance and distribution capabilities of Apache Spark DataFrames. For more information, please refer to Introducing GraphFrames.  For example, the code snippet below allows us to quickly build a GraphFrame based on two tables (DataFrames): airports (vertices in a graph) and flights between those airports (edges in a graph).

# Import graphframes (from Spark-Packages)
from graphframes import *

# Create Vertices (airports) and Edges (flights)
tripVertices = airports.withColumnRenamed("IATA", "id").distinct()
tripEdges = departureDelays.select("tripid", "delay", "src", "dst", "city_dst", "state_dst")

# This GraphFrame builds upon the vertices and edges based on our trips (flights)
tripGraph = GraphFrame(tripVertices, tripEdges)

As we’re using Databricks Community Edition to view this notebook, we can use the display() command to view our DataFrames.

tripEdges-1024x382

You can also run simple queries using GraphFrames like the query below:

print "Airports: %d" % tripGraph.vertices.count()
print "Trips: %d" % tripGraph.edges.count()

which returns the output:

Airports: 279
Trips: 1361141

 

Using Motif Finding to understand flight delays

But the real power of graphs is to utilize graph algorithms – a great example is to use motif finding to understand the complex relationships between the airports (vertices) and flights (edges) – and ultimately flight delays.  In this example, we’re asking the question – what delays might we blame on SFO?

motifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")\
 .filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid > ab.tripid + 10000")
display(motifs)

With SFO as the connecting city (b), we are looking for all flights [ab] from any origin city (a) that will connect to SFO (b) prior to flying [bc] to any destination city (c). We are filtering it such that the delay for either flight ([ab] or [bc]) is greater than 500 minutes and the second flight (bc) occurred within approximately a day of the first flight (ab).

Below is an abridged subset from this query where the columns are the respective motif keys.

Screen Shot 2016-05-28 at 11.14.30 AM

With this motif finding query, we have quickly determined that passengers in this dataset left Houston and Tuscon for San Francisco on time or a little early [1011126].  But for any of those passengers that were flying to New York through this connecting flight in SFO [1021507], they were delayed by 536 minutes.

Visualizing these flights in D3

To get a powerful visualization of the flight paths and connections in this dataset, we can leverage the Airports D3 visualization within our Databricks notebook.  By connecting our GraphFrames, DataFrames, and D3 visualizations, we can visualize the scope of all of the flight connections as noted below for all on-time or early departing flights within this dataset.  The blue circles represent the vertices (i.e. airports) where the size of the circle represents the number of edges (i.e. flights) in and out of those airports.  The black lines are the edges themselves (i.e. flights) and their respective connections to the other vertices (i.e. airports).  Note for any edges that go offscreen, they are representing vertices (i.e. airports) in the states of Hawaii and Alaska.

airports-d3-m

 

What’s Next?

There are a lot of good resources for working with GraphFrames and the flight delay datasets including:

 

Enjoy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s