This is an abridged version of the full blog post On-Time Flight Performance with GraphFrames. You can also reference the webinar GraphFrames: DataFrame-based graphs for Apache Spark and the On-Time Flight Performance with GraphFrames for Apache Spark notebook.
An intuitive approach to understanding flight departure delays is to use graph structures.
The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths. It is easier to understand these data problems within the context of graph structures: vertices, edges, and properties. For example, flight data analysis is a classic graph problem:
- airports are represented by vertices
- flights are represented by edges.
- numerous propertiesassociated with these flights including but not limited to departure delays, plane type, and carrier.
Jumping into GraphFrames
GraphFrames for Apache Spark supports general graph processing and is able to take advantages of the performance and distribution capabilities of Apache Spark DataFrames. For more information, please refer to Introducing GraphFrames. For example, the code snippet below allows us to quickly build a GraphFrame based on two tables (DataFrames): airports (vertices in a graph) and flights between those airports (edges in a graph).
# Import graphframes (from Spark-Packages) from graphframes import * # Create Vertices (airports) and Edges (flights) tripVertices = airports.withColumnRenamed("IATA", "id").distinct() tripEdges = departureDelays.select("tripid", "delay", "src", "dst", "city_dst", "state_dst") # This GraphFrame builds upon the vertices and edges based on our trips (flights) tripGraph = GraphFrame(tripVertices, tripEdges)
As we’re using Databricks Community Edition to view this notebook, we can use the display() command to view our DataFrames.
You can also run simple queries using GraphFrames like the query below:
print "Airports: %d" % tripGraph.vertices.count() print "Trips: %d" % tripGraph.edges.count()
which returns the output:
Airports: 279 Trips: 1361141
Using Motif Finding to understand flight delays
But the real power of graphs is to utilize graph algorithms – a great example is to use motif finding to understand the complex relationships between the airports (vertices) and flights (edges) – and ultimately flight delays. In this example, we’re asking the question – what delays might we blame on SFO?
motifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")\ .filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid > ab.tripid + 10000") display(motifs)
With SFO as the connecting city (b), we are looking for all flights [ab] from any origin city (a) that will connect to SFO (b) prior to flying [bc] to any destination city (c). We are filtering it such that the delay for either flight ([ab] or [bc]) is greater than 500 minutes and the second flight (bc) occurred within approximately a day of the first flight (ab).
Below is an abridged subset from this query where the columns are the respective motif keys.
With this motif finding query, we have quickly determined that passengers in this dataset left Houston and Tuscon for San Francisco on time or a little early . But for any of those passengers that were flying to New York through this connecting flight in SFO , they were delayed by 536 minutes.
Visualizing these flights in D3
To get a powerful visualization of the flight paths and connections in this dataset, we can leverage the Airports D3 visualization within our Databricks notebook. By connecting our GraphFrames, DataFrames, and D3 visualizations, we can visualize the scope of all of the flight connections as noted below for all on-time or early departing flights within this dataset. The blue circles represent the vertices (i.e. airports) where the size of the circle represents the number of edges (i.e. flights) in and out of those airports. The black lines are the edges themselves (i.e. flights) and their respective connections to the other vertices (i.e. airports). Note for any edges that go offscreen, they are representing vertices (i.e. airports) in the states of Hawaii and Alaska.
There are a lot of good resources for working with GraphFrames and the flight delay datasets including:
- The full blog post On-Time Flight Performance with GraphFrames
- GraphFrames: DataFrame-based graphs for Apache Spark webinar
- On-Time Flight Performance with GraphFrames for Apache Spark notebook