Cleaning Data with Spark
Cleaning data is a very common task for data professionals. The data we read from source systems are sometimes corrupt, duplicated, or need some other kind of transformation to adjust to our needs.
I thought I’d create a simple demonstration of running some common clean tasks in Spark Python and SQL. This code was created on Databricks but should work on any flavor of Spark (i.e. Fabric)
The sample file and notebooks I use are on our GitHub repository, see the link at the end of the post.
Let’s look at the sample orders.csv file:
What’s wrong with it? Oh, so many things…
- There is a duplicate row (order_id = 4)
- order_date – there are missing values, the date format is not standard, and there is a bad date (32/10/2024)
- cusomer_id – we should remove orders with no customer (who made the order?)
- amount – we should remove the value AAA , and change the data type to integar
How to fix this issue with Spark?
Task | Spark Python | Spark SQL |
remove duplicates | df.dropDuplicates() | select distinct |
convert date from specific format | df.withColumn(“date_column”,to_date(“date_column”,”dd/MM/yyyy”)) | to_date(date_column,’dd/MM/yyyy’) |
fill null values with default | df.fillna({“column”:0}) | coalesce(column,0) |
remove rows by criteria | df.filter(df.column.isNotNull()) | where column is not null |
change column data type | df.withColumn(“column”,col(“column”).cast(“int”)) | column::int |
And the orders data after the clean:
See the full code step by step in the notebooks on GitHub: https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/clean%20data%20with%20spark
What other common cleaning tasks do you use? Let me know in a comment! Happy cleaning!