Recently there was a great Delta Lake Stackoverflow question DeltaTable schema not updating when using `ALTER TABLE ADD COLUMNS` by wtfzambo. This is a great question as succinctly: What is happening here is actually as expected: To better showcase this, allow me to provide context via the file system. To recreate this exact scenario, please use the docker at https://go.delta.io/docker and use the DELTA_PACKAGE_VERSION as delta-core_2.12:2.1.0. That is, run the Docker container and use the pts-ark steps: 1. To start PySpark, run the command: 2. Run the basic commands to create a simple table 3. Run the following command to see the table structure As…
Tag: spark
On-Time Flight Performance with GraphFrames for Apache Spark
Feature Image: NASA Goddard Space Flight Center: City Lights of the United States 2012 This is an abridged version of the full blog post On-Time Flight Performance with GraphFrames. You can also reference the webinar GraphFrames: DataFrame-based graphs for Apache Spark and the On-Time Flight Performance with GraphFrames for Apache Spark notebook. An intuitive approach to understanding flight departure delays is to use graph structures. Why Graph? The reason for using graph structures is because it is a more intuitive approach to many classes of data problems: social networks, restaurant recommendations, or flight paths. It is easier to understand these data problems…
Presentation: Jump Start into Apache® Spark™ and Databricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. — Apache Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download. You can view the on-demand webinar Jump Start into Apache® Spark™…
Data Exploration with Databricks
Today, it was also featured on InsideBigData: Data Exploration with Databricks. Awesome! This Data Exploration on Databricks jump start video will show you how go from data source to visualization in a few easy steps. Specifically, we will take semi-structured logs, easily extract and transform them, analyze and visualize the data using Spark SQL, so we can quickly understand our data. For more information and to check out other Spark notebooks, check out Selected Notebooks > Databricks Jump Start.
Apache Spark is the Smartphone of Big Data
Similar to the way the smartphone changed the way we communicate – far beyond its original goal of mobile voice telephony – Apache Spark is revolutionizing Big Data. While portability may have been the catalyst of the mobile revolution, it was the ability to have one device perform multiple tasks very well with the ability to easily build and use a diverse range of applications that are the keys to its ubiquity. Ultimately, with the smartphone we have a general platform that has changed the way we communicate, socialize, work, and play. The smartphone has not only replaced older technologies…
Simplify Machine Learning on Spark with Databricks
As many data scientists and engineers can attest, the majority of the time is spent not on the models themselves but on the supporting infrastructure. Key issues include on the ability to easily visualize, share, deploy, and schedule jobs. More disconcerting is the need for data engineers to re-implement the models developed by data scientists for production. With Databricks, data scientists and engineers can simplify these logistical issues and spend more of their time focusing on their data problems. Simplify Visualization An important perspective for data scientists and engineers is the ability to quickly visualize the data and the model…
Spark Ecosystem & Spark Streaming Fundamentals
This is a re-post from Spark Ecosystem & Spark Streaming Fundamentals post on the Concur blog. For the March 2015 Seattle Spark Meetup group convened for our monthly meeting. It was a special meeting since it was actually a joint Seattle Spark Meetup and Pacific Northwest Cloudera User Group session held at the Concur Technologies headquarters in Bellevue. We had special guest speakers, Mike Olson (@mikeolson) and Hari Shreedharan (@harisr1234), whose presentations discussed the basics of Spark and Spark Streaming as well as the Spark ecosystem, specifically, Cloudera’s investments into it. For those who missed it, we’re sharing Mike and…
Spark atop Mesos on Google Cloud Platform querying Google Cloud Storage
A great reason to jump into Spark on Mesos on Google Cloud Platform is because you can quickly spin up a development environment to work with Spark, Mesos, Google Cloud, and Marathon together very quickly. A great way to set this up is to follow the steps in Paco Nathan’s (@pacoid) great blog post Spark atop Mesos on Google Cloud Platform. But what’s missing from this configuration is the ability to connect to Google Cloud Storage (GCS) so you can run your Spark queries off of a persistent elastic storage. As noted in the diagram below, you will first install Spark…