Nimble, newest optimized big-data storage format

Dan · April 11, 2024, 12:10am

Interesting talk about an upcoming open-source data format from Meta called Nimble. It has some nice optimizations. Julia is pretty good with data and with optimizations, so it might be interesting to support this format when it gets going.

Oscar_Smith · April 11, 2024, 12:12am

what’s the tldr of it vs Arrow?

Dan · April 11, 2024, 12:27am

My quick TL;DR understanding is: Arrow is memory-centered, so it has very low CPU overhead and can mmap nicely. Parquet is more disk-centered, so takes less storage (and is more efficient via the communication reduction). Nimble is an optimized version of Parquet. But I’m not an expert on these formats and Nimble is not yet open-sourced, though it’s promised to be in a week or so.

mrufsvold · April 11, 2024, 12:19pm

My biggest TLDR from the YouTube video is that they do not have RowGroup metadata. Instead, each column contains metadata at the top. Nimble is competing with Parquet, not Arrow.

The read optimization here is that you can store all of the file structure metadata in the file footer. Which can be read in a single IO. And you only have to parse the encoding information for the column set that you want.

It also allows for nested encoding within a single row group. So if you have a long stretch of a repeated value, you can use run length encoding, and then if in the next chunk there’s a lot of heterogeneity, you can switch to an array of lookup keys.

Having all these knobs increases the difficulty of deciding on a write strategy. So much so that they are employing machine learning models over their data to pre-compute ideal encoding schemas.

However, those knobs also allow you to tightly tailor your encoding and compression strategy for your read use case. Do you want quick access to a subset of columns? Do you want to have lots of metadata at the top of certain columns so that the reader can use it as a filter? Do you have a custom encoding for a special column But still want most of your file to be readable by older versions of the reader?

Edit: I think some important context here is that Facebook is creating this format to support their ML workloads where they have tables with thousands and thousands of features. So the size of the row group metadata really explodes and it means that you basically have to do an IO to read the footer and then a second IO to seek back to the beginning of the metadata.

Topic		Replies	Views
Arrow, Feather, and Parquet Data parquet , arrow	48	12960	November 1, 2020
Writing Parquet files General Usage	28	5255	November 12, 2020
[ANN] Arrow.jl 0.3 Release Data arrow	21	3174	March 16, 2021
Help with Arrow.jl and size of files Data question , arrow	23	1896	October 21, 2022
What fileformat to use to load data for high performance computing Machine Learning	37	7019	December 1, 2018

Nimble, newest optimized big-data storage format

Related topics