I wasn't aware the Unix philosophy was to not use multithreading on large jobs that can be parallelized.
You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.
Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.
The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.
Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.
In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.
Today this is all old and doesn't cut it anymore on 128 core setups.
I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.
I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.
That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.
I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.
Other algorithms like zstd and gz resulted in much lower compression rates.
I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.
On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?
I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.
I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.
It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.
You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.