I wasn't aware the Unix philosophy was to not use multithreading on large jobs t...

dale_glass · on May 12, 2023

Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.

The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.

Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.

In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.

Today this is all old and doesn't cut it anymore on 128 core setups.

loeg · on May 12, 2023

“Recent” as in the 80s or 90s, sure. Threads are older than unix was when threads were introduced.

fnordpiglet · on May 12, 2023

I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.

loeg · on May 12, 2023

Even if you ignore threading models prior to pthreads, pthreads itself dates to 1995.

fnordpiglet · on May 12, 2023

And Unix 1973

loeg · on May 12, 2023

So: Unix was ~22 when Pthreads were introduced in 1995, and Pthreads are 28 years old now.

wanttocomment · on May 12, 2023

I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.

fnordpiglet · on May 12, 2023

That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.

wanttocomment · on May 12, 2023

That's true, but if you're looking that far back, multicore is new too

fnordpiglet · on May 12, 2023

I guess “that far back” becomes different as you get older :-) it doesn’t seem that long ago to me :-)

saagarjha · on May 12, 2023

Well, if you split the file into chunks you could fan it across cores by compressing each individually.

frde · on May 12, 2023

Having to do things like this is exactly the problem.

mhh__ · on May 12, 2023

Processes can still be fine. If you're lucky you can be debating over whether you'd like something to be 99x or 100x faster.

masklinn · on May 12, 2023

With all due respect to Carmack he’s using bzip in 2023, that’s pretty outdated on every front.

dgacmu · on May 12, 2023

You'd be surprised. There are some workloads - for me, it's geospatial data - where bzip2 clobbers all of the alternatives.

PhilipRoman · on May 12, 2023

I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.

Other algorithms like zstd and gz resulted in much lower compression rates.

I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.

On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?

sudobash1 · on May 12, 2023

I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.

Aachen · on May 12, 2023

I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.

elteto · on May 12, 2023

For some reason, bzip compresses text incredibly well. And it has for years, I remember noticing this almost 20 years ago.

tehjoker · on May 12, 2023

It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.

krzyk · on May 12, 2023

Oh, that's interesting, I stopped using bzip2 at the time kernel sources started shipping in xz.

Do you know if there are any tests showing which compressor is better (compression wise) for which data?