> One of the unique attributes of the (in-progress) Vortex file format is that i...

robert3005 · 2024-10-14T21:34:55.000000Z

The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.

infogulch · 2024-10-15T05:16:35.000000Z

Wow, I think this is the thing I wished existed for years! Most file formats leave a huge compression opportunity on the table just because their choice of physical layout. (I call the simple case "striding order", idk) But getting it right takes a lot of experimentation which becomes too much churn for applications, and can result in storage layouts that are great for compression but are annoying to code against. So the obvious answer (to me at least) is that you need to decouple physical and logical layouts. I'm glad someone is finally trying it!

marginalia_nu · 2024-10-15T10:15:47.000000Z

I've been experimenting with taking this self-description paradigm even farther, for a file format I've cooked up for ephemeral data in my search engine.

Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.

So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):

  combinedId.0.dat.s64le.bin
  documentMeta.0.dat.s64le.bin
  features.0.dat.s32le.bin
  size.0.dat.s32le.bin
  termIds.0.dat-len.varint.bin
  termIds.0.dat.s64le[].zstd
  termMetadata.0.dat-len.varint.bin
  termMetadata.0.dat.s8[].zstd

The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.

gatesn · 2024-10-15T12:19:28.000000Z

Internally the design of Vortex is very similar. The file consists of a whole bunch of "messages" (your files), which then have some metadata attached, and the read logic decides which messages it needs when.

hiatus · 2024-10-15T11:38:16.000000Z

Do you have a deeper writeup of this anywhere?

marginalia_nu · 2024-10-15T12:34:51.000000Z

Not yet, but I will compile one at some point. I'm in the middle of moving right now so I don't quite have the time to sit down and finish the write-up...

sa46 · 2024-10-14T23:21:37.000000Z

Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

[1]: https://parquet.apache.org/docs/file-format/

[2]: https://orc.apache.org/specification/ORCv2/#file-tail

danking00 · 2024-10-15T02:37:55.000000Z

Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.

gigatexal · 2024-10-14T18:57:11.000000Z

As someone who works in data schema on read formats like parquet are amazing. I hate having to guess schemas with CSVs.

physicsguy · 2024-10-14T20:37:00.000000Z

Pandera is quite nice for at least forcing validation in Pandas for this

yawnxyz · 2024-10-15T03:40:30.000000Z

I think it was in a blog post or a podcast (a16z with motherduck?) where they said Snowflake apparently largely solved this problem, but since it's proprietary and locked away, most people won't get a chance to use or implement it?

I have no idea since I've never had access to Snowflake...

jnordwick · 2024-10-14T21:44:57.000000Z

If it's in the footer, then I'm pending to the columns out of the question it seems without moving the footer.

cle · 2024-10-14T21:08:47.000000Z

Isn't this what the Arrow IPC File format does too? Is there something unique about this?

_willmanning · 2024-10-14T21:34:07.000000Z

Compression! Vortex can easily be 10x smaller than the equivalent Arrow representation (and decompresses very quickly into Arrow)

cle · 2024-10-14T23:37:28.000000Z

Nice!