Recently there was a great Delta Lake Stackoverflow question DeltaTable schema not updating when using `ALTER TABLE ADD COLUMNS` by wtfzambo. This is a great question as succinctly:
- Make a change to the schema to a Delta Lake table (e.g., add a new column)
- The change can be seen (i.e., can see the new column) in the in the Delta Lake table history (e.g. when running
DESCRIBE TABLE `delta.mytable`
). - When querying the Spark DataFrame that reads this table, the column is not there.
- But when running an
UPDATE
statement (e.g.,UPDATE delta.mytable SET new_column = 'blah'
)
What is happening here is actually as expected:
- When you ran the
ALTER TABLE
statement, the schema in fact did change and it was registered within the Delta transaction log but up to this point, it was a metadata change. - Once you ran the
UPDATE
statement, then this was registered and the data itself reflected this change, i.e. the Parquet files within the table can see this change. - When you query the DataFrame, it is because the Spark DataFrame could only see the data and metadata together but when you ran the
ALTER TABLE
statement, it was only a metadata change.
To better showcase this, allow me to provide context via the file system. To recreate this exact scenario, please use the docker at https://go.delta.io/docker and use the DELTA_PACKAGE_VERSION
as delta-core_2.12:2.1.0
. That is, run the Docker container and use the pts-ark steps:
1. To start PySpark, run the command:
$SPARK_HOME/bin/pyspark --packages io.delta:delta-core_2.12:2.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
2. Run the basic commands to create a simple table
# Create Spark DataFrame
data = spark.range(0, 5)
# Write Delta table
data.write.format("delta").save("/tmp/delta-table")
# Read Delta table
df = spark.read.format("delta").load("/tmp/delta-table")
# Show Delta table
df.show()
3. Run the following command to see the table structure
DESCRIBE TABLE delta.'/tmp/delta-table/'
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
As well, if you list the files of the table:
NBuser@5b0edf0b8779:/tmp/delta-table$ ls -lsgA
total 44
4 drwxr-xr-x 2 NBuser 4096 Mar 12 23:50 _delta_log
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00000-4abcc1fa-b2c8-441a-a392-8dab57edd819-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00001-6327358c-8c00-4ad6-9e3d-263f0ea66e3f-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00002-eea1d287-be68-4a62-874d-ab4e39c6a825-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00003-c79b4180-5968-4fee-8181-6752d9cfb333-c000.snappy.parquet.crc
4 -rw-r--r-- 1 NBuser 478 Mar 12 23:29 part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet
4 -rw-r--r-- 1 NBuser 12 Mar 12 23:29 .part-00004-c3399acd-75ca-4ea5-85f9-03fa60897161-c000.snappy.parquet.crc
4. Now, let’s run the ALTER command
-- Run Alter command
ALTER TABLE delta.`/tmp/delta-table/` ADD COLUMNS (blah string)
-- Describe table
DESCRIBE TABLE delta.`/tmp/delta-table/`
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| blah| string| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
But, if you were to run the ls -lsgA
for the temp table, note that the files look exactly the same. That is, there are no changes to the data, but only the metadata in the /tmp/delta-table/_delta_log
folder.
NBuser@5b0edf0b8779:/tmp/delta-table/_delta_log$ ls -lsgA
total 16
4 -rw-r--r-- 1 NBuser 2082 Mar 12 23:29 00000000000000000000.json
4 -rw-r--r-- 1 NBuser 28 Mar 12 23:29 .00000000000000000000.json.crc
4 -rw-r--r-- 1 NBuser 752 Mar 12 23:38 00000000000000000001.json
4 -rw-r--r-- 1 NBuser 16 Mar 12 23:38 .00000000000000000001.json.crc
Note the 00000000000000000001.json
which contains the transaction which corresponds to your ALTER TABLE
command. If you were to read this .json
file it would like this:
{"metaData": {
"id":"d583238c-87ab-4de0-a09d-141ef499371d",
"format":{"provider":"parquet","options":{}},
"schemaString":"
{\"type\":\"struct\",
\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},
{\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}"
,"partitionColumns":[],"configuration":{},"createdTime":1678663791967}}
{"commitInfo":{"timestamp":1678664321014,
"operation":"ADD COLUMNS","operationParameters":{"columns":"[{\"column\":{\"name\":\"blah\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}}]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.3.1 Delta-Lake/2.1.0","txnId":"b54db68d-652b-4930-82d5-61a542d82100"}}
Notice the schemaString -> fields which show the blah
column and notice the operation
command that points to a ADD COLUMNS
command which also includes the blah
column.
So the key point here is that while the transaction log contains the
blah
column being added, the root table directory table folder has no changes to the .parquet files meaning that the change was reflected in the metadata but not the data.
Until you ran the UPDATE
command, the changes were not reflected in the .parquet files (i.e. data files). And in the case of the Spark DataFrame, it can only pull the schema when it’s associated with the data.