Cleaning Data with Spark

by Chen · October 16, 2024

Cleaning data is a very common task for data professionals. The data we read from source systems are sometimes corrupt, duplicated, or need some other kind of transformation to adjust to our needs.

I thought I’d create a simple demonstration of running some common clean tasks in Spark Python and SQL. This code was created on Databricks but should work on any flavor of Spark (i.e. Fabric)

The sample file and notebooks I use are on our GitHub repository, see the link at the end of the post.

Let’s look at the sample orders.csv file:

What’s wrong with it? Oh, so many things…

There is a duplicate row (order_id = 4)
order_date – there are missing values, the date format is not standard, and there is a bad date (32/10/2024)
cusomer_id – we should remove orders with no customer (who made the order?)
amount – we should remove the value AAA , and change the data type to integar

How to fix this issue with Spark?

Task	Spark Python	Spark SQL
remove duplicates	df.dropDuplicates()	select distinct
convert date from specific format	df.withColumn(“date_column”,to_date(“date_column”,”dd/MM/yyyy”))	to_date(date_column,’dd/MM/yyyy’)
fill null values with default	df.fillna({“column”:0})	coalesce(column,0)
remove rows by criteria	df.filter(df.column.isNotNull())	where column is not null
change column data type	df.withColumn(“column”,col(“column”).cast(“int”))	column::int

And the orders data after the clean:

See the full code step by step in the notebooks on GitHub: https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/clean%20data%20with%20spark

What other common cleaning tasks do you use? Let me know in a comment! Happy cleaning!

Cleaning Data with Spark

You may also like...

Recent Posts

Recent Comments

Categories

Cleaning Data with Spark

You may also like...

SQL Windows Functions might be non-deterministic

Monitor Databricks costs with the new Dashboard and budgets

The Databricks Debugger

Recent Posts

Recent Comments

Categories