Hacker News new | past | comments | ask | show | jobs | submit login

> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.






The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.

Wow, I think this is the thing I wished existed for years! Most file formats leave a huge compression opportunity on the table just because their choice of physical layout. (I call the simple case "striding order", idk) But getting it right takes a lot of experimentation which becomes too much churn for applications, and can result in storage layouts that are great for compression but are annoying to code against. So the obvious answer (to me at least) is that you need to decouple physical and logical layouts. I'm glad someone is finally trying it!

I've been experimenting with taking this self-description paradigm even farther, for a file format I've cooked up for ephemeral data in my search engine.

Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.

So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):

  combinedId.0.dat.s64le.bin
  documentMeta.0.dat.s64le.bin
  features.0.dat.s32le.bin
  size.0.dat.s32le.bin
  termIds.0.dat-len.varint.bin
  termIds.0.dat.s64le[].zstd
  termMetadata.0.dat-len.varint.bin
  termMetadata.0.dat.s8[].zstd

The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.

Internally the design of Vortex is very similar. The file consists of a whole bunch of "messages" (your files), which then have some metadata attached, and the read logic decides which messages it needs when.

Do you have a deeper writeup of this anywhere?

Not yet, but I will compile one at some point. I'm in the middle of moving right now so I don't quite have the time to sit down and finish the write-up...

Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

[1]: https://parquet.apache.org/docs/file-format/

[2]: https://orc.apache.org/specification/ORCv2/#file-tail


Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.

As someone who works in data schema on read formats like parquet are amazing. I hate having to guess schemas with CSVs.

Pandera is quite nice for at least forcing validation in Pandas for this

I think it was in a blog post or a podcast (a16z with motherduck?) where they said Snowflake apparently largely solved this problem, but since it's proprietary and locked away, most people won't get a chance to use or implement it?

I have no idea since I've never had access to Snowflake...


If it's in the footer, then I'm pending to the columns out of the question it seems without moving the footer.

Isn't this what the Arrow IPC File format does too? Is there something unique about this?

Compression! Vortex can easily be 10x smaller than the equivalent Arrow representation (and decompresses very quickly into Arrow)

Nice!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: