Working with Excel files in Databricks

Excel is one of the most common data file formats, and, as data engineers, we are required to read data from it on almost every project. Excel is easy to use, and you can customize it quickly, like adding a column and changing data. But the same things that made it the go-to format for users, make it hard to read by Data platforms. Adding a column might break a pipeline, and changing datatypes, for example, adding text to a column that only held numeric data before, might cause a nasty error downstream.

Working in Databricks, you can read and write Excel files, but you need to pay attention to some pitfalls. So let’s get started, working with Excel files on Databricks!

Reading excel files

First of all, you can’t read excel files out of the box in Databrick. yes, I know it’s weird but that’s how it is. Databricks has built in methods to read a lot of files: CSV, Parquet, ORC, etc. But no built-in reader for Excel, so we need to install a library. There are a number of options, I use this one:

%pip install openpyxl

Don’t forget that you can also install the library on the cluster and then it’s available on every notebook connected to this cluster. see instructions here: https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries

Now let’s import the read_excel function, and read the Excel file:

from pyspark.pandas import read_excel
pd_df = read_excel(mount_folder+"/orders dbx demo.xlsx","Sheet1")

The mount_folder parameter holds the path to the folder of the Excel file. It can also be a Volume if you are using Unity Catalog.

read_excel takes 2 parameters: the file path, and the worksheet name (or index) to read. There are other parameters, see the documentation for more details: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html

It returns a dataframe and we can display it with display(pd_df) but it’s not a spark dataframe and we cannot write it directly to a delta table. We need to transform it to a spark dataframe:

spark_df = pd_df.to_spark()
display(spark_df)

Dealing with problematic characters in column names

One of the issues that may arise on Excel files, is that people often use characters on column names that are unacceptable as Delta table column name. Lets try to write our current dataframe to a delta table:

spark_df.write.mode("overwrite").saveAsTable("dwh.orders")
error:
[DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES] Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.

As the error indicates, there are several characters that cannot be used in a Delta table column name. In this example, we have a space in “Order ID”.

To deal with that, we can use the translate python function that replaces characters, and we’ll pass to it a list of characters we want to replace.

trnaslate_table = str.maketrans(" ","_",",;{}()\n\t=")
new_cols = [i.translate(trnaslate_table) for i in spark_df.columns]
spark_with_new_cols = spark_df.toDF(*new_cols)

The maketrans method accepts 3 arguments:

  • characters to replace
  • what to replace them with
  • characters to delete from the string altogether

In this example, I’m trying to replace a space with an underscore, and remove all other unacceptable characters (,;{}()\n\t=).

The new_cols created are a list of columns after the replacement, and then I created a new dataframe, now with the new column names.

Now we can write the dataframe to a table:

spark_with_new_cols.write.mode("overwrite").saveAsTable("dwh.orders")

Writing to an Excel file

To write data from Databricks to an Excel table we need to go the same way in the opposite direction. Create a spark dataframe that reads from a table, convert it to a Pandas dataframe, and then use to_excel to write the dataframe to an Excel file.

read_table_df = spark.read.table("dwh.orders")
pandas_df = read_table_df.toPandas()

But here is another funny thing. You can’t write directly to your mount or volume. You’ll get an error message. I found a workaround on this answer on StackOverflow – https://stackoverflow.com/questions/73155969/writing-pandas-dataframe-to-excel-in-dbfs-azure-databricks-oserror-errno-95 . You need to write the Excel file to the local file system (on the cluster driver) and then copy it to your mount\volume.

from shutil import move
move('/tmp/pandas_df.xlsx',f'/dbfs{mount_folder}/pandas_df.xlsx')
dbutils.fs.ls(mount_folder)

Thats it. You can (and probably would) work with Excel files in Databricks, just beware of the issues I mention above. Let me know in the comments if you find any more issues, or if you have any ideas to improve this method to working with Excel files!

The notebook and the Excel demo file can be found on our GitHub repository – https://github.com/MadeiraData/microsoft-data-engineers-club/tree/main/Databricks/Working%20with%20Excel%20files%20in%20Databricks

You may also like...

1 Response

  1. September 16, 2024

    […] Chen Hirsh deals with truly big data: […]