Nimble, newest optimized big-data storage format

Interesting talk about an upcoming open-source data format from Meta called Nimble. It has some nice optimizations. Julia is pretty good with data and with optimizations, so it might be interesting to support this format when it gets going.

2 Likes

what’s the tldr of it vs Arrow?

4 Likes

My quick TL;DR understanding is: Arrow is memory-centered, so it has very low CPU overhead and can mmap nicely. Parquet is more disk-centered, so takes less storage (and is more efficient via the communication reduction). Nimble is an optimized version of Parquet. But I’m not an expert on these formats and Nimble is not yet open-sourced, though it’s promised to be in a week or so.

My biggest TLDR from the YouTube video is that they do not have RowGroup metadata. Instead, each column contains metadata at the top. Nimble is competing with Parquet, not Arrow.

The read optimization here is that you can store all of the file structure metadata in the file footer. Which can be read in a single IO. And you only have to parse the encoding information for the column set that you want.

It also allows for nested encoding within a single row group. So if you have a long stretch of a repeated value, you can use run length encoding, and then if in the next chunk there’s a lot of heterogeneity, you can switch to an array of lookup keys.

Having all these knobs increases the difficulty of deciding on a write strategy. So much so that they are employing machine learning models over their data to pre-compute ideal encoding schemas.

However, those knobs also allow you to tightly tailor your encoding and compression strategy for your read use case. Do you want quick access to a subset of columns? Do you want to have lots of metadata at the top of certain columns so that the reader can use it as a filter? Do you have a custom encoding for a special column But still want most of your file to be readable by older versions of the reader?

Edit: I think some important context here is that Facebook is creating this format to support their ML workloads where they have tables with thousands and thousands of features. So the size of the row group metadata really explodes and it means that you basically have to do an IO to read the footer and then a second IO to seek back to the beginning of the metadata.

1 Like