Recently there was a great Delta Lake Stackoverflow question DeltaTable schema not updating when using `ALTER TABLE ADD COLUMNS` by wtfzambo. The problem is as follows:
ADD COLUMN to a Delta Lake table; the column is not visible in the Spark DataFrame
- Make a change to the schema to a Delta Lake table (e.g., add a new column)
- The change can be seen (i.e., can see the new column) in the Delta Lake table history (e.g., when executing
DESCRIBE TABLE `delta.mytable`
). - The column cannot be seen when querying the Spark DataFrame that reads this table.
- But after a
UPDATE
statement (e.g.,UPDATE delta.mytable SET new_column = 'blah'
), the column can now be seen by the Spark DataFrame.
This is happening as expected because
- When you execute the
ALTER TABLE
command, the schema changes and the column is registered within the Delta transaction log. - But at this point, it was a metadata-only change. Therefore, the parquet (data) files that comprise the table do not contain the column.

- The column would not be visible when querying the table through Spark (or any other API) because it did not exist.
- Upon executing the
UPDATE
command, the change was made physically to the table. - Now that the column exists, the column would be visible when querying the table.

Let’s see this process in action
To showcase this better, let’s look at the code, transaction log, and underlying file system. To recreate this exact scenario, please use the docker at https://go.delta.io/docker and use the DELTA_PACKAGE_VERSION
as delta-core_2.12:2.1.0
. That is, run the Docker container and use the following steps:
1. To start PySpark, run the command:
$SPARK_HOME/bin/pyspark --packages io.delta:delta-core_2.12:2.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
2. Run the basic commands to create a simple table
# Create Spark DataFrame
data = spark.range(0, 5)
# Write Delta table
data.write.format("delta").save("/tmp/delta-table")
# Read Delta table
df = spark.read.format("delta").load("/tmp/delta-table")
# Show Delta table
df.show()
3. Run the following command to see the table structure
DESCRIBE TABLE delta.'/tmp/delta-table/'
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
Run the following command to list the files that make up the table …
NBuser@5b0edf0b8779:/tmp/delta-table$ ls -lsgA
total 44
4 drwxr-xr-x 2 NBuser 4096 Mar 12 23:50 _delta_log
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet.crc
and the underlying metadata within the Delta log.
NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser 28 Mar 12 23:29 .00000000000000000000.json.crc
4. Now, let’s run the ALTER command
-- Run Alter command
ALTER TABLE delta.`/tmp/delta-table/` ADD COLUMNS (blah string)
-- Describe table
DESCRIBE TABLE delta.`/tmp/delta-table/`
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| blah| string| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
But, if you were to run the ls -lsgA
for the temp table, note that the files look exactly the same. There are no changes to the data, only the metadata in the /tmp/delta-table/_delta_log
folder.
NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser 28 Mar 12 23:29 .00000000000000000000.json.crc
4 -rw-r--r-- 1 NBuser 752 Mar 12 23:38 00000000000000000001.json
4 -rw-r--r-- 1 NBuser 16 Mar 12 23:38 .00000000000000000001.json.crc
Note the 00000000000000000001.json
which contains the transaction which corresponds to your ALTER TABLE
command. If you were to read this .json
file it is similar to the following snippet:
{"metaData": {
"id":"d583238c-87ab-4de0-a09d-141ef499371d",
"format":{"provider":"parquet","options":{}},
"schemaString":"
{\"type\":\"struct\",
\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}"
,"partitionColumns":[],"configuration":{},"createdTime":1678663791967}}
{"commitInfo":{"timestamp":1678664321014,
"operation":"ADD COLUMNS",
"operationParameters":
{"columns":"[
{\"column\":
{\"name\":\"blah\",
\"type\":\"string\",
\"nullable\":true,
\"metadata\":{}
}
}
]"
},
"readVersion":0,
"isolationLevel":"Serializable",
"isBlindAppend":true,
"operationMetrics":{},
"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0",
"txnId":"b54db68d-652b-4930-82d5-61a542d82100"}
}
Notice the schemaString
-> fields, which show the blah
column and notice operation
that points to a ADD COLUMNS
command, which also includes the blah
column.
So the key point is that while the transaction log contains the blah
column being added, the root table directory table folder has no changes to the .parquet files meaning that the change was reflected in the metadata but not the data.
Until you ran the UPDATE
command, the changes were not reflected in the .parquet files (i.e., data files). And in the case of the Spark DataFrame, it can only pull the schema when it’s associated with the data.
Run the ls -lsgA
command in the table folder again and you will see more files.
Join the community
To know more about Delta Lake, join us on Slack, LinkedIn, and Google Groups!
Leave a Reply