I'm afraid I'm going against the Unix idiom of combining simple tools to do more advanced stuff, I can't resists here ;_)
While it is idiomatic in Unix to use xargs for parallelising batch runs I found it pretty cumbersome because you have to be really careful with quotes, file names and command lines with spaces to make sure the command line will be "nice" in order to not fuck up something serious.
Moreover, xargs does have its uses but I mostly find I use it for trivial things where I can be sure it works. The xargs idiom seems to be fed a list of files, even more typically from the find command, as in "find . -name _out | xargs rm -r". That's the reason there's -0 in xargs while there's the matching -print0 in find.
I wrote a small utility myself (http://code.google.com/p/spawntool/) that reads from stdin and treats each line as a complete command line that is directly passed to system(), and then manages the parallelisation up to N processes.
This is pretty useful for feeding in _any_ batch of commands, even unrelated (not derived from a list of files). You could also feed the same input stream or file straight to 'sh' (for compatibility cross-checking) or you could verify the input command lines in plaintext before daring with either sh or spawntool. This would be like ... | xargs sh without the white-space and expansion head-aches.
It's pretty easy to generate complete command lines yourself and much safer than letting xargs join stuff together.
Running these two commands in series is likely vastly overstating the performance gains - almost all your time is going to be spent in io, and the second time around you'll have a good chunk (if not all) of it in disk cache. Try running both a few repeated times and see if you enjoy the same gains (on my iphone right now, so I can't do this myself)
As an example, consider running md5sum on a directory of 1436 jpeg files of about 50k each on an Atom 330 ( 2 dies, 2 hyperthreads/die). The 8way numbers use xargs with -n8 and -P8. The flushed cache entries indicate that all three caches have been flushed immediately before the command ...
elapsed user system
Cold cache: 11.98 0.56 1.75 ========================
second run: 2.71 1.38 1.76 =====
third run: 3.17 1.40 1.75 ======
flushed cache: 12.11 0.58 1.80 ========================
flushed cache, 8way: 12.64 1.00 2.27 =========================
second run, 8way: 1.22 1.15 2.52 ==
third run, 8way: 1.17 1.26 2.26 ==
I do not explain the oddity that user time seems abnormally low with flushed cache, or perhaps abnormally high with cache data present, particularly with the single threaded version.
'time', and typing. The graph bars are my right index finger while I count in my head. I should probably patent that before it gets out. Wait, I just published. I'm screwed.
dtach is great for long-running jobs, too. If you pipe the output to a file, you can even log out and check back later.
I use this function to pass stuff off to a detached process:
# usage:
# headless "some_long_job" "long_job"
# go get some tea, and come back
# headless "" "long_job" (to join that session)
# still not done, so ^\ to detach from it again
# Usually, I pipe the output of some_long_job to
# a file, so I can peek in on it easily
headless () {
if [ "$2" == "" ]; then
hash=`md5 -qs "$1"`
else
hash="$2"
fi
if [ "$1" != "" ]; then
dtach -n /tmp/headless-$hash bash -l -c "$1"
else
dtach -A /tmp/headless-$hash bash -l
fi
}
I find that xargs is the most convenient way to achieve parallelism for quick and easy batch work. Just write your script to receive its unit of work as a command argument (or as multiple args if starting a process is a heavy operation). Use any language. Utilize all your cores.
While it is idiomatic in Unix to use xargs for parallelising batch runs I found it pretty cumbersome because you have to be really careful with quotes, file names and command lines with spaces to make sure the command line will be "nice" in order to not fuck up something serious.
Moreover, xargs does have its uses but I mostly find I use it for trivial things where I can be sure it works. The xargs idiom seems to be fed a list of files, even more typically from the find command, as in "find . -name _out | xargs rm -r". That's the reason there's -0 in xargs while there's the matching -print0 in find.
I wrote a small utility myself (http://code.google.com/p/spawntool/) that reads from stdin and treats each line as a complete command line that is directly passed to system(), and then manages the parallelisation up to N processes.
This is pretty useful for feeding in _any_ batch of commands, even unrelated (not derived from a list of files). You could also feed the same input stream or file straight to 'sh' (for compatibility cross-checking) or you could verify the input command lines in plaintext before daring with either sh or spawntool. This would be like ... | xargs sh without the white-space and expansion head-aches.
It's pretty easy to generate complete command lines yourself and much safer than letting xargs join stuff together.