Altering Delta Lake table is not reflected in Spark

Why does altering a Delta Lake table schema not show up in the Spark DataFrame?

March 20, 2023

Recently there was a great Delta Lake Stackoverflow question DeltaTable schema not updating when using `ALTER TABLE ADD COLUMNS` by wtfzambo. The problem is as follows:

ADD COLUMN to a Delta Lake table; the column is not visible in the Spark DataFrame

Make a change to the schema to a Delta Lake table (e.g., add a new column)
The change can be seen (i.e., can see the new column) in the Delta Lake table history (e.g., when executing DESCRIBE TABLE `delta.mytable`).
The column cannot be seen when querying the Spark DataFrame that reads this table.
But after a UPDATE statement (e.g., UPDATE delta.mytable SET new_column = 'blah'), the column can now be seen by the Spark DataFrame.

This is happening as expected because

When you execute the ALTER TABLE command, the schema changes and the column is registered within the Delta transaction log.
But at this point, it was a metadata-only change. Therefore, the parquet (data) files that comprise the table do not contain the column.

Add column alters the Delta metadata but not just the data (parquet files).

The column would not be visible when querying the table through Spark (or any other API) because it did not exist.
Upon executing the UPDATE command, the change was made physically to the table.
Now that the column exists, the column would be visible when querying the table.

Update makes changes to the data (parquet files), so you can now see the column in your Delta table

Let’s see this process in action

To showcase this better, let’s look at the code, transaction log, and underlying file system. To recreate this exact scenario, please use the docker at https://go.delta.io/docker and use the DELTA_PACKAGE_VERSION as delta-core_2.12:2.1.0. That is, run the Docker container and use the following steps:

1. To start PySpark, run the command:

$SPARK_HOME/bin/pyspark --packages io.delta:delta-core_2.12:2.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

2. Run the basic commands to create a simple table

# Create Spark DataFrame
data = spark.range(0, 5)
# Write Delta table
data.write.format("delta").save("/tmp/delta-table")
# Read Delta table
df = spark.read.format("delta").load("/tmp/delta-table")
# Show Delta table
df.show()

3. Run the following command to see the table structure

DESCRIBE TABLE delta.'/tmp/delta-table/'
+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+

Run the following command to list the files that make up the table …

NBuser@5b0edf0b8779:/tmp/delta-table$ ls -lsgA
total 44
4 drwxr-xr-x 2 NBuser 4096 Mar 12 23:50 _delta_log
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet.crc

and the underlying metadata within the Delta log.

NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser   28 Mar 12 23:29 .00000000000000000000.json.crc

4. Now, let’s run the ALTER command

-- Run Alter command
ALTER TABLE delta.`/tmp/delta-table/` ADD COLUMNS (blah string)
-- Describe table
DESCRIBE TABLE delta.`/tmp/delta-table/`
+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|           blah|   string|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+

But, if you were to run the ls -lsgA for the temp table, note that the files look exactly the same. There are no changes to the data, only the metadata in the /tmp/delta-table/_delta_log folder.

NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser   28 Mar 12 23:29 .00000000000000000000.json.crc
4 -rw-r--r-- 1 NBuser  752 Mar 12 23:38 00000000000000000001.json
4 -rw-r--r-- 1 NBuser   16 Mar 12 23:38 .00000000000000000001.json.crc

Note the 00000000000000000001.json which contains the transaction which corresponds to your ALTER TABLE command. If you were to read this .json file it is similar to the following snippet:

{"metaData": {
    "id":"d583238c-87ab-4de0-a09d-141ef499371d",
    "format":{"provider":"parquet","options":{}},
    "schemaString":"
        {\"type\":\"struct\",
         \"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
                     {\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}"
                     ,"partitionColumns":[],"configuration":{},"createdTime":1678663791967}}
    {"commitInfo":{"timestamp":1678664321014,
        "operation":"ADD COLUMNS",
        "operationParameters":
             {"columns":"[
                 {\"column\":
                    {\"name\":\"blah\",
                     \"type\":\"string\",
                     \"nullable\":true,
                     \"metadata\":{}
                  }
               }
              ]"
             },
    "readVersion":0,
    "isolationLevel":"Serializable",
    "isBlindAppend":true,
    "operationMetrics":{},
    "engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0",
    "txnId":"b54db68d-652b-4930-82d5-61a542d82100"}
}

Notice the schemaString -> fields, which show the blah column and notice operation that points to a ADD COLUMNS command, which also includes the blah column.

So the key point is that while the transaction log contains the blah column being added, the root table directory table folder has no changes to the .parquet files meaning that the change was reflected in the metadata but not the data.

Until you ran the UPDATE command, the changes were not reflected in the .parquet files (i.e., data files). And in the case of the Spark DataFrame, it can only pull the schema when it’s associated with the data.

Run the ls -lsgA command in the table folder again and you will see more files.

Join the community

To know more about Delta Lake, join us on Slack, LinkedIn, and Google Groups!

dennyglee

Delta Lake, spark

2 responses to “Why does altering a Delta Lake table schema not show up in the Spark DataFrame?”

Aditya

February 11, 2024 at 12:04 pm

I am Just curious,
As we say in spark, when we use parquet file formats, it utilises the metadata of files to show the counts of data in the file.

So doesn’t that same happens with schema as well for delta tables (which basically is an abstraction over parquet files)?

Loading…

Reply
1. dennyglee
  
  February 11, 2024 at 6:08 pm
  
  Great question Aditya, that is correct when working with Spark, Spark will read the tail of the Parquet files to read its metadata. But note that with Delta Lake, we place the schema within the Delta transaction log. In this specific scenario, the context is that the user was making changes to the Delta schema without actually making any changes to the Parquet data files yet. HTH!
  
  Loading…
  
  Reply

Denny Lee

Why does altering a Delta Lake table schema not show up in the Spark DataFrame?

ADD COLUMN to a Delta Lake table; the column is not visible in the Spark DataFrame

Let’s see this process in action

Join the community

Like this:

2 responses to “Why does altering a Delta Lake table schema not show up in the Spark DataFrame?”

Leave a ReplyCancel reply

Why does altering a Delta Lake table schema not show up in the Spark DataFrame?

ADD COLUMN to a Delta Lake table; the column is not visible in the Spark DataFrame

Let’s see this process in action

Join the community

Share this:

Like this:

2 responses to “Why does altering a Delta Lake table schema not show up in the Spark DataFrame?”

Leave a ReplyCancel reply

Discover more from Denny Lee