Databricks job clusters are stricter than interactive clusters

Some errors in my code just drive me crazy, because they don’t make any sense. Last week I encountered such an error, and it took me a really long time to find a solution.

In my Pyspark code on a Databricks notebook, I tried to convert a list of numbers to a dataframe. The code was something like that:

numbers_list = [1,2,3,4]
numbers_df = spark.createDataFrame(numbers_list, "number int")
display(numbers_df)

The code worked as expected, resulting in a spark dataframe:

Happy and unsuspecting, I scheduled my notebook with a workflow (workflows are jobs in Databricks) and ran the workflow.

The workflows failed with the following error:

[FIELD_DATA_TYPE_UNACCEPTABLE] StructType([StructField(‘recipe_id’, LongType(), True)]) can not accept object 1 in type <class ‘int’>.

I hope you agree that this error doesn’t make any sense. How can the number 1 not fit into the int data type? I went and checked the documentation for int data type and found that – The range of numbers is from -2,147,483,648 to 2,147,483,647. That range includes the number 1, right? And also, how come it worked when I ran the notebook directly?

After looking online, and asking ChatGPT (he doesn’t know everything, believe me), I was advised to check the spark version for both clustered. That makes sense. different versions could lead to different behavior. I compared the spark version on my interactive cluster (running my notebook directly) to the job cluster (used in the workflow) and they have the same spark version. I also checked other spark environment settings, but could not find any differences.

I tried to run the job with the interactive cluster, and the code finished successfully. But interactive clustered costs more, and it seems like a waste to use it just because of this stupid error.

I went back online and finally found this answer on StackOverflow.

Spark createDataFrame command expected a list of lists, to convert to a dataframe. So I needed to convert my list to a list of lists:

numbers_list_of_lists = [[x] for x in numbers_list]
print(numbers_list_of_lists)

And then use the list of lists to convert to a dataframe:

numbers_df = spark.createDataFrame(numbers_list_of_lists, "number int")
display(numbers_df)

On my interactive cluster, this code gives the same results as the simpler code (see the screenshot above), but on my job, instead of the type error, I now get the actual results.

So why is there a difference between the job and interactive clusters?

I could not find any reference to this in the Databricks documentation but apparently job clusters have a more strict policy than interactive clusters. That’s why the interactive cluster accepts the code above, while the job cluster fails with the same code.

So, I found a solution, but I would really like to find the explanation in the official documentation.

If you can find another reason for this behavior or a documentation URL that explains it, please let me in a comment!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *