Hacker News new | past | comments | ask | show | jobs | submit login

But pigz shows that the unix pipeline philosophy works just fine. (of course compressing before tarring is probably better than compressing the tarred file, but that should be pipelinable as well)



For many small files, compressing first will compress worse, because each file has its own dictionary; compressing last means you can take advantage of similarities in files to improve compression ratio.

Compressing first can also be slower if the average file size is smaller than the block size, because the main thread cannot queue new jobs as fast as cores complete them (this happens e.g. with 7zip at fast compression settings with solid archive turned off). Tarring then compressing means small files can be aggregated into a single block, giving both good speed and compression ratio.


Zstandard has a dictionary functionality which allows you to pre-train on sample data to achieve higher compression ratios and faster compression on large numbers of small files.


Don't you then need to store that dictionary somewhere out of band? It seems like you would still need a tar-like format to manage that, at which point as an archive format it seems more complicated with a worse separation of concerns.


Interesting - does it fit that automatically or is there a manual step?


> compressing before tarring is probably better

Not if the files are similar. If you're compressing the files separately you'll start with a clean state rather than reusing previous fragments. Compressing a BMP after a TXT may not be beneficial, but compressing 3 tar'ed TXTs is definitely better than doing them separately.


That is true and seems easily half the total size with small and similar files, but it also means you have to unpack the whole archive when you need the last file in a tarball.

AFAIK the gzip command still cannot compress directory information and therefore needs tar in front of it if you want to retain a folder structure.


Sometimes you need fast indexed access to a specific file in the compressed content without decompressing the entire file (let's say JARs, that are just ZIPs).

TIL: you can use method 93 - Zstandard (zstd) Compression - with ZIPs


The only point of using .zip is for maximum compatibility and you loose that if you use anything other than deflate or no compression. If you are going to use something else you might as well use a less wasteful and better defined archive container - or something entirely custom like most games do.


93?





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: