Write data to one CSV file in Databricks
Let’s say you need to export data to a CSV file. Maybe you have to share this file with some external system. In Databricks, you can use Spark to write a CSV file, but you may not like the output format.
Here is an example, I read some data into a dataframe, and then wrote it to CSV format:
destination_folder_path = "/Volumes/my_catalog/my_schema/export"
#read some data into a dataframe
df = spark.read.table("some_table")
df.write.mode("overwrite").csv("destination_folder_path", header=True)
The results in the folder:
There are 3 issues with this output:
Output data is split into multiple files
To write and read data faster, Spark splits the work between nodes in a cluster, each reading\writing part of the data. That’s why, in the screenshot above, there are 3 CSV files (That’s the files starting with “Part”, with a CSV extension), instead of 1. Note that this can also occur when working with a single node cluster since Spark splits the work into tasks.
This behavior is great if you intend to keep working with the CSV files in Databricks since reading will be faster. But if you want to share this file with someone outside of Databricks, this may be inconvenient.
File name is weird
Since Spark created multiple files per data directory, it names the files with some internal ID. If you want to share this file, you probably want to name it something reasonable.
There are other files in the folder
Spark also writes some additional files with some metadata about the writing operation. When you share the folder, you may want these unnecessary files removed.
So, what can we do?
Option 1 – Using Pandas
We can use the Pandas library to write to a CSV file:
#Convert the Spark DataFrame to a Pandas DataFrame
pd_df = df.toPandas()
# Use Pandas to write to Excel
destination_file_name = "my_one_exported_file.csv"
pd_df.to_csv(f"{destination_folder_path}/{file_name}", index=False)
And it works! we only got 1 file, and it has the name we want.
But –
Working with Pandas, you lose the Spark parallelism ability, and read and write operations will take longer.
Even if you are willing to compromise on performance (because you need the data to be in 1 file with your chosen name), unlike Spark, where the data is split between the executors in the cluster, Pandas will upload all the data to the driver node memory. This means that for a large dataset, you may have memory issues, or get the terrible “out of memory” error.
So, this solution is only good for small data frames, or when you use a cluster with a lot of memory.
Option 2 – Make Spark create 1 file (and then rename it)
def write_to_one_csv(df, destination_folder_path, destination_file_name):
# Make dateframe into a single partition and write to CSV
df.coalesce(1).write.mode("overwrite").csv(destination_folder_path, header=True)
# Remove unneeded files, and rename csv file to template
list_dir = dbutils.fs.ls(destination_folder_path)
for file in list_dir:
file_name = file.name
file_path = file.path
if file_name.startswith("part"):
source_path = f"{destination_folder_path}/{file_name}"
destination_path = f"{destination_folder_path}/{destination_file_name}"
dbutils.fs.mv(source_path, destination_path)
print(f"{file_name} renamed to {destination_file_name}")
else:
file_path = f"{destination_folder_path}/{file_name}"
dbutils.fs.rm(file_path)
print(f"{file_name} removed")
And running the function:
write_to_one_csv(df, destination_folder_path, destination_file_name)
Let’s explain what we did here. First of all, to make Spark write only one file, we need to make it convert the data from multiple partitions to one. This is achieved by using the Coalesce function.
Since that data is now in one partition, Spark will only write one CSV file.
# Make dateframe into a single partition and write to CSV
df.coalesce(1).write.mode("overwrite").csv(destination_folder_path, header=True)
Now we need to check all the files in the folder. We use dbutils.fs.ls to list the files in the folder and then loop on the list. If the file name starts with “Part”, we know it’s our data file and rename it to the name we want. Any other file (metadata files) is removed. The remove and rename operations are also done with the dbutils library.
This function gives us the required results, but don’t forget that we are also losing something. When we force Spark to only use one partition we actually make it hold the data on just one executor from the cluster, those making this executor work hard while the other executors are idle.
Conclusion
When exporting data to a single CSV in Databricks:
- Use Pandas for small datasets or when simplicity is key.
- Use Spark with
coalesce
for larger datasets, but be mindful of the trade-offs in performance and parallelism. Always choose the approach that aligns with your data size, performance needs, and sharing requirements.
The notebook with all the code is available here: https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/Write%20data%20to%20one%20CSV%20file%20in%20Databricks
If you have another option to create one file in Databricks, I’d love to hear in the comments!