Why does altering a Delta Lake table schema not show up in the Spark DataFrame?

Recently there was a great Delta Lake Stackoverflow question DeltaTable schema not updating when using `ALTER TABLE ADD COLUMNS` by wtfzambo. This is a great question as succinctly:

  • Make a change to the schema to a Delta Lake table (e.g., add a new column)
  • The change can be seen (i.e., can see the new column) in the in the Delta Lake table history (e.g. when running DESCRIBE TABLE `delta.mytable`).
  • When querying the Spark DataFrame that reads this table, the column is not there.
  • But when running an UPDATE statement (e.g., UPDATE delta.mytable SET new_column = 'blah')

What is happening here is actually as expected:

  • When you ran the ALTER TABLE statement, the schema in fact did change and it was registered within the Delta transaction log but up to this point, it was a metadata change.
  • Once you ran the UPDATE statement, then this was registered and the data itself reflected this change, i.e. the Parquet files within the table can see this change.
  • When you query the DataFrame, it is because the Spark DataFrame could only see the data and metadata together but when you ran the ALTER TABLE statement, it was only a metadata change.

To better showcase this, allow me to provide context via the file system. To recreate this exact scenario, please use the docker at https://go.delta.io/docker and use the DELTA_PACKAGE_VERSION as delta-core_2.12:2.1.0. That is, run the Docker container and use the pts-ark steps:

1. To start PySpark, run the command:

$SPARK_HOME/bin/pyspark --packages io.delta:delta-core_2.12:2.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

2. Run the basic commands to create a simple table

# Create Spark DataFrame
data = spark.range(0, 5)

# Write Delta table
data.write.format("delta").save("/tmp/delta-table")

# Read Delta table
df = spark.read.format("delta").load("/tmp/delta-table")

# Show Delta table
df.show()

3. Run the following command to see the table structure

DESCRIBE TABLE delta.'/tmp/delta-table/'

+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+

As well, if you list the files of the table:

NBuser@5b0edf0b8779:/tmp/delta-table$ ls -lsgA
total 44
4 drwxr-xr-x 2 NBuser 4096 Mar 12 23:50 _delta_log
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser  478 Mar 12 23:29 part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser   12 Mar 12 23:29 .part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet.crc

4. Now, let’s run the ALTER command

-- Run Alter command
ALTER TABLE delta.`/tmp/delta-table/` ADD COLUMNS (blah string)

-- Describe table
DESCRIBE TABLE delta.`/tmp/delta-table/`

+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|             id|   bigint|       |
|           blah|   string|       |
|               |         |       |
| # Partitioning|         |       |
|Not partitioned|         |       |
+---------------+---------+-------+

But, if you were to run the ls -lsgA for the temp table, note that the files look exactly the same. That is, there are no changes to the data, but only the metadata in the /tmp/delta-table/_delta_log folder.

NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser   28 Mar 12 23:29 .00000000000000000000.json.crc
4 -rw-r--r-- 1 NBuser  752 Mar 12 23:38 00000000000000000001.json
4 -rw-r--r-- 1 NBuser   16 Mar 12 23:38 .00000000000000000001.json.crc

Note the 00000000000000000001.json which contains the transaction which corresponds to your ALTER TABLE command. If you were to read this .json file it would like this:

{"metaData": {
    "id":"d583238c-87ab-4de0-a09d-141ef499371d",
    "format":{"provider":"parquet","options":{}},
    "schemaString":"
        {\"type\":\"struct\",
         \"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
                     {\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}"
                     ,"partitionColumns":[],"configuration":{},"createdTime":1678663791967}}
    {"commitInfo":{"timestamp":1678664321014,
        "operation":"ADD COLUMNS","operationParameters":{"columns":"[{\"column\":{\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}}]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0","txnId":"b54db68d-652b-4930-82d5-61a542d82100"}}

Notice the schemaString -> fields which show the blah column and notice the operationcommand that points to a ADD COLUMNS command which also includes the blah column.

So the key point here is that while the transaction log contains the blah column being added, the root table directory table folder has no changes to the .parquet files meaning that the change was reflected in the metadata but not the data.

Until you ran the UPDATE command, the changes were not reflected in the .parquet files (i.e. data files). And in the case of the Spark DataFrame, it can only pull the schema when it’s associated with the data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s