Instant data replication with Databricks table cloning

The simplest use case to explain why table cloning is helpful is this: Let’s say you have a large table, and you want to test some new process on it, but you don’t want to ruin the data for other processes, so you need a clean copy of your table (or multiple tables) to play with. Coping a large table might take time (Databricks does it very fast, but if it’s a big table it still takes time to copy the data) ,and what happens if you then need to change your code? you have to drop the target table, copy the source table again, and so on.

here is where cloning can be your friend. There are 3 types of cloning:

  • Deep Clone
  • Shallow Clone
  • Incremental Clone

Before we dive into the types of clones and how they work, a short reminder on how Delta tables are built.

Delta tables contain 2 parts: The actual data, residing in Parquet files in the main table folder, and the transaction log, residing in JSON files in the _delta_log folder, under the main table folder. The transaction log files contain versioning information about changes to the table and can be used to access previous states of the table (which is called time travel) or to restore the table to a previous version.

We can view the history entries using the “describe history table_name” command. In the next parts, we’ll see how we can use this versioning capability to help us easily clone tables.

Deep Clone

Deep clone is the easiest to explain since it just copies the entire table. It creates a new table with exactly the same structure (columns, data types, partitioning) and copies all data files to the new table folder.

The Syntax to deep clone a table (the deep keyword is optional since this is the default):

create table deep_clone deep clone base_table

Deep clone advantages are that the new copy is not dependent on the source table. The disadvantage is that if the source table is large, coping might take some time, and also duplicate storage use.

Shallow Clone

If I want to create a new table in an instant, I can use shallow clone. With this type of cloning, a new table is created, with the same metadata as the old table, but no data files are copied. Instead, when you query the new table, Databricks knows to go and read the data files from the source table. When you change the new table, inserting, updating, and even deleting data, new versions are created in the new table folder.

Shallow clone create syntax:

create table shallow_clone shallow clone base_table

Shallow clones are really fast since they don’t actually copy any data. The disadvantage here is that they are dependent on the data files of the old table. deleting data files, by using the Vacuum command (remove files older than X days), of deleting the source table will result in an error message when trying to read the new table. So shallow clone is great for testing when you need the data fast but not for a long time.

Before we go on, I want to clarify one more thing about cloning whether deep or shallow. Changes made to the source table will not hinder our new tables, but will also not reflect in them. So if for example, after making a clone, I will insert new rows into the source table, they will not appear in my clone, which will continue to show the number of rows at clone time.

The reason for that is that inserting (and other changes) creates new versions and the clone table uses the old version (for shallow clone) or the copy of the old version (for deep clone).

Incremental Clone

This is actually a sub-type of deep clone, but it’s very useful, and I wasn’t aware of the possibility until a few days ago.

If we try to clone a table that already exists, we’ll get an error message:

create table deep_clone deep clone base_table
[DELTA_TABLE_ALREADY_EXISTS] Table `spark_catalog`.`demo_delta`.`deep_clone` already exists. SQLSTATE: 42P07

But if we run replace table instead…

replace table deep_clone deep clone base_table

Databrick is smart enough to look in deep_clone version history, see that it’s a clone of base_table, and only copies changes that occurred on base_table since the last clone, this is of course, much faster and demands lower compute resources.

Note: If you make schema changes to the cloned table, for example, create a new column, they will be overwritten with the second clone command.

You may also like...

1 Response

  1. September 17, 2024

    […] Chen Hirsh hogs the photocopier: […]

Leave a Reply

Your email address will not be published. Required fields are marked *