An intuitive approach to understanding flight departure delays is to use graph structures with Apache Spark™ GraphFrames.
This is an abridged version of On-Time Flight Performance with GraphFrames. The associated notebook works on Apache Spark 2.0+.
Feature Image: NASA Goddard Space Flight Center: City Lights of the United States 2012
The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths. It is easier to understand these data problems within the context of graph structures: vertices, edges, and properties. For example, flight data analysis is a classic graph problem:
- Airports are represented by vertices
- Flights are represented by edges.
- Numerous properties are associated with these flights, including but not limited to departure delays, plane type, and carrier.
Jumping into GraphFrames
GraphFrames for Apache Spark supports general graph processing and can take advantage of the performance and distribution capabilities of Apache Spark DataFrames. For more information, please refer to Introducing GraphFrames. For example, the code snippet below allows us to quickly build a GraphFrame based on two tables (DataFrames): airports (vertices in a graph) and flights between those airports (edges in a graph).
# Import graphframes (from Spark-Packages) from graphframes import * # Create Vertices (airports) and Edges (flights) tripVertices = airports.withColumnRenamed("IATA", "id").distinct() tripEdges = departureDelays_geo.select("tripid", "delay", "src", "dst", "city_dst", "state_dst") # This GraphFrame builds upon the vertices and edges based on our trips (flights) tripGraph = GraphFrame(tripVertices, tripEdges)
As we’re using Databricks Community Edition to view this notebook, we can use the display() command to view our DataFrames.
You can also run simple queries using GraphFrames, such as the query below:
print "Airports: %d" % tripGraph.vertices.count() print "Trips: %d" % tripGraph.edges.count()
which returns the output:
Airports: 279 Trips: 1361141
Using Motif finding to understand flight delays
But, the real power of graphs is to utilize graph algorithms. A great example is to use motif finding to understand the complex relationships between the airports (vertices), flights (edges), and flight delays. So let’s ask the question – what delays might we blame on SFO?
motifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")\ .filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid < ab.tripid + 10000") display(motifs)
With SFO as the connecting city (b), we are looking for all flights [ab]
- from any origin city (a) that will connect to SFO (b)
- before flying [bc] to any destination city (c).
- filtering these flights such that the delay for either flight ([ab] or [bc]) is
- greater than 500 minutes, and
- the second flight (bc) occurred within approximately a day of the first flight (ab).
Below is an abridged subset from this query where the columns are the respective motif keys.
With this motif finding query, we have quickly determined that passengers in this dataset left Houston and Tuscon for San Francisco on time or a little early . But for any of those passengers that were flying to New York through this connecting flight in SFO , they were delayed by 536 minutes.
Visualizing these flights in D3
To get a powerful visualization of the flight paths and connections in this dataset, we can leverage the Airports D3 visualization within our Databricks notebook. By connecting our GraphFrames, DataFrames, and D3 visualizations, we can visualize the scope of all of the flight connections, as noted below, for all on-time or early departing flights within this dataset. The blue circles represent the vertices (i.e., airports), and the circle size represents the number of edges (i.e., flights) in and out of those airports. The black lines are the edges themselves (i.e., flights) and their respective connections to the other vertices (i.e., airports). Note that any edges that go offscreen represent vertices (i.e., airports) in the states of Hawaii and Alaska.
There are a lot of good resources for working with GraphFrames and the flight delay datasets, including:
- Source: On-Time Flight Performance with GraphFrames
- On-Time Flight Performance with GraphFrames for Apache Spark notebook