New Video: Optimizing Azure Data Factory ForEach Parallel Execution
I just published a new video to our YouTube channel explaining how Azure Data Factory ForEach activities work as far as parallel execution. In Azure Data Factory (ADF), inner/child activities execute in parallel by default. You can make the inner activities execute serially, but you often want the parallelism. You can set a batch count that tells ADF the maximum number of simultaneous executions, but it doesn’t guarantee that number. One thing that many data engineers don’t realize is that when the ForEach activity receives the collection of items to iterate over, it essentially assigns those items to internal queues and never rebalances the queues. ADF has no information about resources required or expected duration of the activities. So we sometimes end up with very uneven queues. There might be activities waiting to be executed while there is a spot open for execution, but the work won’t get moved into another queue.
If you need to reduce your total duration and/or your compute needs for the inner activities, you can make explicit queues and pre-assign the inner activities to those queues in a way that better balances the work. Check out the video below for more details about the problem and the solution.
Leave a Reply
Want to join the discussion?Feel free to contribute!