NASA image acquired April 18 - October 23, 2012

This image of the United States of America at night is a composite assembled from data acquired by the Suomi NPP satellite in April and October 2012. The image was made possible by the new satellite’s “day-night band” of the Visible Infrared Imaging Radiometer Suite (VIIRS), which detects light in a range of wavelengths from green to near-infrared and uses filtering techniques to observe dim signals such as city lights, gas flares, auroras, wildfires, and reflected moonlight.

“Nighttime light is the most interesting data that I’ve had a chance to work with,” says Chris Elvidge, who leads the Earth Observation Group at NOAA’s National Geophysical Data Center. “I’m always amazed at what city light images show us about human activity.” His research group has been approached by scientists seeking to model the distribution of carbon dioxide emissions from fossil fuels and to monitor the activity of commercial fishing fleets. Biologists have examined how urban growth has fragmented animal habitat. Elvidge even learned once of a study of dictatorships in various parts of the world and how nighttime lights had a tendency to expand in the dictator’s hometown or province.

Named for satellite meteorology pioneer Verner Suomi, NPP flies over any given point on Earth's surface twice each day at roughly 1:30 a.m. and p.m. The polar-orbiting satellite flies 824 kilometers (512 miles) above the surface, sending its data once per orbit to a ground station in Svalbard, Norway, and continuously to local direct broadcast users distributed around the world. Suomi NPP is managed by NASA with operational support from NOAA and its Joint Polar Satellite System, which manages the satellite's ground system.

NASA Earth Observatory image by Robert Simmon, using Suomi NPP VIIRS data provided courtesy of Chris Elvidge (NOAA National Geophysical Data Center). Suomi NPP is the result of a partnership between NASA, NOAA, and t

On-Time Flight Performance with GraphFrames for Apache Spark

Feature Image: NASA Goddard Space Flight Center: City Lights of the United States 2012 This is an abridged version of the full blog post On-Time Flight Performance with GraphFrames. You can also reference the webinar GraphFrames: DataFrame-based graphs for Apache Spark and the On-Time Flight Performance with GraphFrames for Apache Spark notebook. An intuitive approach to understanding flight departure delays is to use graph structures. Why Graph? The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths.  It is easier to understand these data problems…

Rate this:

Presentation: Jump Start into Apache® Spark™ and Databricks

These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. — Apache Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download. You can view the on-demand webinar Jump Start into Apache® Spark™…

Rate this:


Data Exploration with Databricks

Today, it was also featured on InsideBigData: Data Exploration with Databricks.  Awesome!   This Data Exploration on Databricks jump start video will show you how go from data source to visualization in a few easy steps. Specifically, we will take semi-structured logs, easily extract and transform them, analyze and visualize the data using Spark SQL, so we can quickly understand our data. For more information and to check out other Spark notebooks, check out Selected Notebooks > Databricks Jump Start.  

Rate this:


Apache Spark is the Smartphone of Big Data

Similar to the way the smartphone changed the way we communicate – far beyond its original goal of mobile voice telephony – Apache Spark is revolutionizing Big Data. While portability may have been the catalyst of the mobile revolution, it was the ability to have one device perform multiple tasks very well with the ability to easily build and use a diverse range of applications that are the keys to its ubiquity. Ultimately, with the smartphone we have a general platform that has changed the way we communicate, socialize, work, and play. The smartphone has not only replaced older technologies…

Rate this:

Correlation drowning and Nicholas Cage films

Interested in career in Data Sciences? Read Freakonomics first!

Over the last few years, a common question I’ve been asked is what does it take to become a data scientist?  Often my answers surrounded the technology – i.e. learn Spark, Python, and/or R; take courses in Data Sciences; play with data sets; etc.   Yet, I was never fully satisfied with that answer because I had always felt that the heart of Data Sciences (and Big Data in more generic terms) is the data – or more specifically, the ability to understand the data. Recently, I re-read “Freakonomics: A Rogue Economist Explores the Hidden Side of Everything” and it dawned…

Rate this:


Data Engineering Reading Materials: Spark, Machine Learning, and Distributed Systems Resources

Over the last few weeks, a regular question that I’ve been asked are where I can find resources about Spark, Machine Learning, and Distributed Systems.  While they seem to be disparate problems, the fact is that as a Data Engineer (or someone in Data Sciences Engineering or a Data Scientist that loves scalability and performance) you need to have your feet wet in all three disciplines to truly excel. Apache Spark Let’s start with Apache Spark (disclosure, I am with Databricks – the company was founded by the creators of Apache Spark).   I am a big fan of Apache Spark because of its…

Rate this:


Simplify Machine Learning on Spark with Databricks

As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure.  Key issues include on the ability to easily visualize, share, deploy, and schedule jobs.  More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production.  With Databricks, data scientists and engineers can simplify these logistical issues and spend more of their time focusing on their data problems. Simplify Visualization An important perspective for data scientists and engineers is the ability to quickly visualize the data and the model…

Rate this: