Databricks job clusters are stricter than interactive clusters

by Chen · March 9, 2025

Some errors in my code just drive me crazy, because they don’t make any sense. Last week I encountered such an error, and it took me a really long time to find a solution.

In my Pyspark code on a Databricks notebook, I tried to convert a list of numbers to a dataframe. The code was something like that:

numbers_list = [1,2,3,4]
numbers_df = spark.createDataFrame(numbers_list, "number int")
display(numbers_df)

The code worked as expected, resulting in a spark dataframe:

Happy and unsuspecting, I scheduled my notebook with a workflow (workflows are jobs in Databricks) and ran the workflow.

The workflows failed with the following error:

[FIELD_DATA_TYPE_UNACCEPTABLE] StructType([StructField(‘recipe_id’, LongType(), True)]) can not accept object 1 in type <class ‘int’>.

I hope you agree that this error doesn’t make any sense. How can the number 1 not fit into the int data type? I went and checked the documentation for int data type and found that – The range of numbers is from -2,147,483,648 to 2,147,483,647. That range includes the number 1, right? And also, how come it worked when I ran the notebook directly?

After looking online, and asking ChatGPT (he doesn’t know everything, believe me), I was advised to check the spark version for both clusteres. That makes sense. different versions could lead to different behavior. I compared the spark version on my interactive cluster (running my notebook directly) to the job cluster (used in the workflow) and they have the same spark version. I also checked other spark environment settings, but could not find any differences.

I tried to run the job with the interactive cluster, and the code finished successfully. But interactive clustered costs more, and it seems like a waste to use it just because of this stupid error.

I went back online and finally found this answer on StackOverflow.

Spark createDataFrame command expected a list of lists, to convert to a dataframe. So I needed to convert my list to a list of lists:

numbers_list_of_lists = [[x] for x in numbers_list]
print(numbers_list_of_lists)

And then use the list of lists to convert to a dataframe:

numbers_df = spark.createDataFrame(numbers_list_of_lists, "number int")
display(numbers_df)

On my interactive cluster, this code gives the same results as the simpler code (see the screenshot above), but on my job, instead of the type error, I now get the actual results.

So why is there a difference between the job and interactive clusters?

I could not find any reference to this in the Databricks documentation but apparently job clusters have a more strict policy than interactive clusters. That’s why the interactive cluster accepts the code above, while the job cluster fails with the same code.

So, I found a solution, but I would really like to find the explanation in the official documentation.

If you can find another reason for this behavior or a documentation URL that explains it, please let me know in a comment!

Databricks job clusters are stricter than interactive clusters

You may also like...

Recent Posts

Recent Comments

Categories

Databricks job clusters are stricter than interactive clusters

You may also like...

Instant data replication with Databricks table cloning

Ingest Google Analytics data into Databricks

Databricks Posts Collection

Recent Posts

Recent Comments

Categories