Databricks workflows for-each – not quite there yet
Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.
Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently
As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.
Using the for-each task
Setting the external task (for-each)
When creating a for-each task, you can pass a list of parameters:
In the example above, I passed a list containing the numbers 4,7 and 9. This means that Databricks will run the internal task 3 times, each time passing a different number as the input.
You can also get the input from a preceding task in this format (where first_task is the former task, and parameter is the value it returns):
{{tasks.first_task.values.parameter}}
For details on how to return the parameter from the previous task see here – https://docs.databricks.com/en/jobs/share-task-context.html
Concurrency:
This is a big deal and one of the better features of the for-each task. If set, it will run multiple iterations in parallel, making the all process finish faster. The default is 1. and the maximum is 100. Please note that this may also weigh on the cluster since multiple processes will run in parallel and require more resources.
Setting the internal task
Now let’s see the internal task configuration:
I’m running a notebook called “update rows” and passing the parameter “row_number” which will contain the current iteration parameter (that’s what {{input}} means).
On the result page, we can see the run of each notebook, including the passed parameters, run status, and duration. We can also click on the start time link to see the actual notebook run.
For-each task limitations
- You can only run one task inside the for-each loop, so flexibility is very limited. You can put all your logic in one notebook and run it, but then you move the management (precedence rules, error handling) from the workflow into the notebook.
- You can only use the “input” on parameters. You cannot use it in the path of the notebook to run, so if you need to run multiple different notebooks (and for me that’s the main use case), this feature will not help you.
Conclusion
This is a feature I’ve been waiting for, to be able to run different notebooks in parallel. But that use case isn’t supported yet, and I think the other use cases for this feature are less common. But Databricks surprises us with new features and improvements all the time, so maybe we’ll see this functionally too sometime soon.
I would love to hear from you about what use cases you think are available for this feature. Drop a comment below!
1 Response
[…] Chen Hirsh goes in a loop: […]