One trick that works surprisingly often when you don't understand the root cause...

One trick that works surprisingly often when you don't understand the root cause: scale the number of batches, rather than the batch size (think merge sort or map/reduce).

If 10% of your batches flake out, you rerun them and expect 1% to have to be run a third time, etc., giving you an O((batch run time)*log(number of batches)) expected wait until you can merge/reduce. Making the batches smaller means you have a shorter wait to merge/reduce (but also possibly worse merge time, so it doesn't always work; YMMV, etc.).