Lakehouse spring cleaning – Vacuum your Delta tables
Spring is traditionally the time for the yearly house cleaning. In the Jewish culture, we clean to remove the “Hametz” from the house before the holiday of passover. In other cultures, after a long, cold winter indoors, one should open the windows to let the warmer spring air in.
So maybe this is a good time to remember that your data lakehouse also needs cleaning. Here are some questions and answers about the vacuum command on Databricks.
Why should I vacuum?
Delta tables create new files for every change made to the table (insert, update, delete). You can use the old files to “time travel” – to query or restore older versions of your table. This is a wonderful feature, but over time, these files accumulate in your storage and will increase your storage costs.
How does the Vacuum command work?
The vacuum command removes files older than the time you specify.
Examples:
VACUUM my_table RETAIN 168 hours
Run vacuum on my_table. Files older than 7 days (168 hours) will be deleted.
VACUUM my_table RETAIN 168 hours DRY RUN
This command will not delete any files, just check if there are files that need to be deleted (based on the number of hours you specified) and print the first 1000 files.
If successful, the vacuum command only returns the path of the cleaned table folder. It does not return other data about the operation.
When should I vacuum?
Databricks recommends running vacuum “periodically”, which is a little vague. I usually run a job to vacuum my tables once a week.
How much should I vacuum?
Databricks recommends leaving at least 7 days of files. This is in case running processes are still updating files. You should also remember that deleting files will make it impossible to go back to these versions. So you should balance storage costs with the number of history versions you want to be able to go back to.
How can I do it automatically?
Lately, databricks has started to support predictive optimization mode, which you can set at the catalog, schema, or table level. When enabled, it will run vacuum when needed, as well as other optimizations for your tables (like optimize and analyze). Note that you will pay for the serverless cluster used to run these operations. Applicable only for Delta tables on Unity Catalog. For more details about predictive optimizations, see here – https://docs.databricks.com/aws/en/optimizations/predictive-optimization.
Another option is to create your own script. Here is an example that vacuums all the tables in a database:
catalog = "my_catalog"
schema = "my_database"
hours_to_retain = 1080 # 45 days
table_list = spark.sql(f"SHOW TABLES in {catalog}.{schema}").collect()
for t in table_list:
table_name = t.tableName
print(table_name)
vaccum_sql = f"VACUUM {catalog}.{schema}.{table_name} RETAIN {hours_to_retain} HOURS"
spark.sql(vaccum_sql)
Documentation:
Remove unused data files with vacuum – https://docs.databricks.com/aws/en/delta/vacuum
VACUUM – https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum