Cleaning Data with Spark

Cleaning data is a very common task for data professionals. The data we read from source systems are sometimes corrupt, duplicated, or need some other kind of transformation to adjust to our needs.

I thought I’d create a simple demonstration of running some common clean tasks in Spark Python and SQL. This code was created on Databricks but should work on any flavor of Spark (i.e. Fabric)

The sample file and notebooks I use are on our GitHub repository, see the link at the end of the post.

Let’s look at the sample orders.csv file:

What’s wrong with it? Oh, so many things…

  • There is a duplicate row (order_id = 4)
  • order_date – there are missing values, the date format is not standard, and there is a bad date (32/10/2024)
  • cusomer_id – we should remove orders with no customer (who made the order?)
  • amount – we should remove the value AAA , and change the data type to integar

How to fix this issue with Spark?

TaskSpark PythonSpark SQL
remove duplicatesdf.dropDuplicates()select distinct
convert date from specific formatdf.withColumn(“date_column”,to_date(“date_column”,”dd/MM/yyyy”))to_date(date_column,’dd/MM/yyyy’)
fill null values with defaultdf.fillna({“column”:0})coalesce(column,0)
remove rows by criteriadf.filter(df.column.isNotNull())where column is not null
change column data typedf.withColumn(“column”,col(“column”).cast(“int”))column::int

And the orders data after the clean:

See the full code step by step in the notebooks on GitHub: https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/clean%20data%20with%20spark

What other common cleaning tasks do you use? Let me know in a comment! Happy cleaning!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *