Hacker News new | past | comments | ask | show | jobs | submit login

I wasn't aware the Unix philosophy was to not use multithreading on large jobs that can be parallelized.

You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.




Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.

The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.

Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.

In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.

Today this is all old and doesn't cut it anymore on 128 core setups.


“Recent” as in the 80s or 90s, sure. Threads are older than unix was when threads were introduced.


I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.


Even if you ignore threading models prior to pthreads, pthreads itself dates to 1995.


And Unix 1973


So: Unix was ~22 when Pthreads were introduced in 1995, and Pthreads are 28 years old now.


I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.


That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.


That's true, but if you're looking that far back, multicore is new too


I guess “that far back” becomes different as you get older :-) it doesn’t seem that long ago to me :-)


Well, if you split the file into chunks you could fan it across cores by compressing each individually.


Having to do things like this is exactly the problem.


Processes can still be fine. If you're lucky you can be debating over whether you'd like something to be 99x or 100x faster.


With all due respect to Carmack he’s using bzip in 2023, that’s pretty outdated on every front.


You'd be surprised. There are some workloads - for me, it's geospatial data - where bzip2 clobbers all of the alternatives.


I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.

Other algorithms like zstd and gz resulted in much lower compression rates.

I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.

On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?


I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.


I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.


For some reason, bzip compresses text incredibly well. And it has for years, I remember noticing this almost 20 years ago.


It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.


Oh, that's interesting, I stopped using bzip2 at the time kernel sources started shipping in xz.

Do you know if there are any tests showing which compressor is better (compression wise) for which data?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: