Chen Hirsh's Data Engineering Blog -

Chen Hirsh's Data Engineering Blog Blog

Data Engineering / Databricks / Delta Tables

April 21, 2025

Lakehouse spring cleaning – Vacuum your Delta tables

Your Lakehouse tables need some spring cleaning, too. Use the vacuum command to delete older versions and save on storage costs.

Data Engineering / Databricks / Python

March 9, 2025

Databricks job clusters are stricter than interactive clusters

While converting a list to a dataframe, I got a type error, but only on a job cluster, not on an interactive cluster. How to get around that and why does it happen?

Databricks

February 2, 2025

Think Twice Before Deleting a User: Avoiding Ownership Chaos in Databricks

Deleting a user in Databricks might seem harmless—until workflows start failing, SQL queries break, and ownership chaos unfolds. In this post, I share a hard-learned lesson about Databricks ownership, how to prevent disruptions, and what to do if you’ve already made the mistake. Learn best practices for managing SQL objects, workflows, and user permissions to avoid unexpected failures. Because when it comes to user deletion in Databricks, thinking twice can save you from a major headache.

Data Engineering / Databricks / Python

January 26, 2025

Write data to one CSV file in Databricks

Exporting data to a CSV file in Databricks can sometimes result in multiple files, odd filenames, and unnecessary metadata—issues that aren’t ideal when sharing data externally. This guide explores two practical solutions: using Pandas for small datasets and leveraging Spark’s coalesce to consolidate partitions into a single, clean file. Learn how to choose the right approach for your use case and ensure your CSV exports are efficient, shareable, and hassle-free.

Databricks / Python

November 17, 2024

The Databricks Debugger

Exploring the Databricks Debugger: Writing flawless code on the first try is a dream, but debugging is a reality for most developers. In this post, I dive into the new Databricks code cell debugger, sharing my first impressions and tips for getting started with this powerful tool.

Data Engineering / Python

November 4, 2024

How to enable system tables on Databricks

System tables on Databricks can help us monitor and manage our Data Warehouse. In this post I’ll show how to enable them and how to install the Jobs Dashboard based on system tables.

Data Engineering / Databricks / Fabric / Python / SQL

October 16, 2024

Cleaning Data with Spark

Cleaning data is a very common task for data professionals. In this post, I demonstrate a few common data cleaning task with spark Python and SQL.

Databricks

October 6, 2024

Monitor Databricks costs with the new Dashboard and budgets

As Data Engineers we need to monitor usage and costs of our data solutions. Databricks lately released tools to help use do that: the Account Usage Dashboard and Budgets. Both based on the “Billing” system schema.

SQL / SQL Server

October 1, 2024

SQL Windows Functions might be non-deterministic

How using SQL Windows functions with non unique order column can cause indeterminate results

Data Engineering / Databricks

September 24, 2024

Databricks workflows for-each – not quite there yet

Databricks recently added a for-each task to their workflow capability. How does it work and what are its limitations?

Databricks

September 15, 2024

Instant data replication with Databricks table cloning

Cloning tables in Databricks is a fast way to create replicated data for test proposes, or archiving. Explore the different types of table cloning, each with its pros and cons.

Chen Hirsh's Data Engineering Blog