Parallelizing Jobs with xargs

yason · on March 13, 2009

I'm afraid I'm going against the Unix idiom of combining simple tools to do more advanced stuff, I can't resists here ;_)

While it is idiomatic in Unix to use xargs for parallelising batch runs I found it pretty cumbersome because you have to be really careful with quotes, file names and command lines with spaces to make sure the command line will be "nice" in order to not fuck up something serious.

Moreover, xargs does have its uses but I mostly find I use it for trivial things where I can be sure it works. The xargs idiom seems to be fed a list of files, even more typically from the find command, as in "find . -name _out | xargs rm -r". That's the reason there's -0 in xargs while there's the matching -print0 in find.

I wrote a small utility myself (http://code.google.com/p/spawntool/) that reads from stdin and treats each line as a complete command line that is directly passed to system(), and then manages the parallelisation up to N processes.

This is pretty useful for feeding in _any_ batch of commands, even unrelated (not derived from a list of files). You could also feed the same input stream or file straight to 'sh' (for compatibility cross-checking) or you could verify the input command lines in plaintext before daring with either sh or spawntool. This would be like ... | xargs sh without the white-space and expansion head-aches.

It's pretty easy to generate complete command lines yourself and much safer than letting xargs join stuff together.

mattj · on March 13, 2009

Running these two commands in series is likely vastly overstating the performance gains - almost all your time is going to be spent in io, and the second time around you'll have a good chunk (if not all) of it in disk cache. Try running both a few repeated times and see if you enjoy the same gains (on my iphone right now, so I can't do this myself)

delano · on March 13, 2009

You are correct sir!

     $ time find . -newerct '10 hours ago' -print
     real	0m4.262s
     user	0m0.090s
     sys	0m1.014s

     $ time find . -newerct '10 hours ago' -print     
     real	0m0.516s
     user	0m0.057s
     sys	0m0.251s

     $ time find . -newerct '10 hours ago' -print     
     real	0m0.302s
     user	0m0.056s
     sys	0m0.244s

     $ time find . -newerct '10 hours ago' -print     
     real	0m0.311s
     user	0m0.057s
     sys	0m0.251s

     $ find . | wc -l
     13167

Is there a way to reliably flush the disk cache besides restarting?

jws · on March 13, 2009

Speaking of Linux, yes. You can echo mysterious numbers into /proc/sys/vm/drop_caches to flush various caches. http://jim.studt.net/depository/index.php/flushing-caches-fo... for a one liner, http://linux-mm.org/Drop_Caches for the info from the horse's mouth.

As an example, consider running md5sum on a directory of 1436 jpeg files of about 50k each on an Atom 330 ( 2 dies, 2 hyperthreads/die). The 8way numbers use xargs with -n8 and -P8. The flushed cache entries indicate that all three caches have been flushed immediately before the command ...

                    elapsed   user   system
     Cold cache:       11.98   0.56   1.75  ========================
     second run:        2.71   1.38   1.76  =====
      third run:        3.17   1.40   1.75  ======
  flushed cache:       12.11   0.58   1.80  ========================
  flushed cache, 8way: 12.64   1.00   2.27  =========================
     second run, 8way:  1.22   1.15   2.52  ==
      third run, 8way:  1.17   1.26   2.26  ==

I do not explain the oddity that user time seems abnormally low with flushed cache, or perhaps abnormally high with cache data present, particularly with the single threaded version.

delano · on March 13, 2009

Great, thanks.

What did you use to generate that output?

jws · on March 13, 2009

'time', and typing. The graph bars are my right index finger while I count in my head. I should probably patent that before it gets out. Wait, I just published. I'm screwed.

prewett · on March 13, 2009

You have a year after first publication to patent. But you have successfully prevented anyone else from patenting (unless they already applied).

IsaacSchlueter · on March 13, 2009

dtach is great for long-running jobs, too. If you pipe the output to a file, you can even log out and check back later.

I use this function to pass stuff off to a detached process:

    # usage:
    # headless "some_long_job" "long_job"
    # go get some tea, and come back
    # headless "" "long_job" (to join that session)
    # still not done, so ^\ to detach from it again
    # Usually, I pipe the output of some_long_job to
    # a file, so I can peek in on it easily
    headless () {
      if [ "$2" == "" ]; then
        hash=`md5 -qs "$1"`
      else
        hash="$2"
      fi
      if [ "$1" != "" ]; then
        dtach -n /tmp/headless-$hash bash -l -c "$1"
      else
        dtach -A /tmp/headless-$hash bash -l
      fi
    }

mblakele · on March 13, 2009

When using this sort of trick, I also find it useful to throw in GNU screen, nohup, or the bash 'disown' command.

aolnerd · on March 13, 2009

I find that xargs is the most convenient way to achieve parallelism for quick and easy batch work. Just write your script to receive its unit of work as a command argument (or as multiple args if starting a process is a heavy operation). Use any language. Utilize all your cores.