Suggestion for sometimes faster data serialization

Hey Julians,

So I had this idea, and I was going to make an issue on a requisite package for it, then I realized the context of it is potentially broader then any single package.

So in the geospatial domain a trick for reducing file size is to cleave off floating point noise from the digits that are significant in all stored files (well known trick). Now, in many other domains this is also true! Some datastores only need 4-8 byte precision, but after say manipulation they end up with all this floating point junk.

Is there a foreseeable benefit for a widespread pattern to be implemented in say CSV.jl, JSON.jl, XLSX.jl, JuliaDB.jl, parquet.jl, etc for allowing a “Significant digit” truncation? I realize this is one of those semidangerous things, if an end user “thinks they know” the truncation level and messes up well that sucks for them, but if you genuinely know, it’s kind of a godsend for large swaths of I/O.

I could imagine some type wrappers for some of these columnar I/O libraries specifying the significant digits for each column to save on bytes. Maybe some formats are already handling this to some extent?

Anyways, just a passing thought I had while trying to optimize my own ZMQ streams…

Can’t you just transform to Float32 or similar, and serialize that?

2 Likes

That’s one possibility definitely. Maybe this is a moot point actually. I don’t know the guts of how JSON/CSV reads plain text and turns it into native types. I suppose it probably isn’t doing anything at the byte level so probably no performance gains. Then, for binary file stores, I suppose skipping the first N bytes or trailing N bytes isn’t a huge gain either. Maybe someone with more knowledge of the internals knows better that this is pointless?

For parsing string types this nice though. Less char reads, smaller disk & memory footprint.

I think it could help reduce file size for plain text formats (such as JSON). Maybe one could just convert (or round) to a fixed point type like https://github.com/JuliaMath/FixedPointDecimals.jl and then write it to a file from there?

I have some rotatable 3D plots at the bottom of this post https://ericphanson.com/blog/2018/locally-maximizing-the-renyi-entropies/ that I made a couple years ago with Plots.jl and the plotlyjs backend (one of my first experiments with Julia!) and when that page loads, the browser downloads some (plaintext) .js files with hardcoded data points each specified to 16 digits, which seems pretty wasteful. I’m sure there’s a better way by writing some JavaScript to load the data from a binary format, but this was the easiest— and anyway, I just wanted to show a collaborator some 3D plots. I think it would cut down on file size without affecting the plot quality just by truncating to 5 or 6 digits. So to my mind there’s definitely a use-case for the technique.

1 Like

Beautiful plots! For the case of 2-D plots, we could easily determine pixel resolution and use that as a cut off. 3-D would be harder. I can foresee a use case here for some file types and applications. I’m just not exposed to the internals of all the IO libraries like some people are. Nor am I computer scientist and never tried to play one on TV.

Hmmm… Fixed point decimals could be handy for this.

1 Like

If you care about efficiency that much you should maybe be using binary formats, not text formats like CSV or JSON. Of course, transformating to a lower precision for storage is often a good option for binary formats as well.

2 Likes

The laszip compression uses this principle to attain impressive compression ratios.

2 Likes

@stevengj - I agree with that, and that’s kind of where I am looking once I find a nice route to do that with. Maybe BSON for me personally, but we’ll see. But, for some people, CSV’s and JSON are nice because they don’t require any special or seasonal libraries to open them up and see whats inside. Hundreds of thousands of people use those files daily to perform rudimentary analysis’ and it’s a drain on the world to store noise/null values. The main niche I guess is someone boss could open them in excel, or notepad and see something, and corrupting plain text is pretty difficult. For archival purposes that’s kinda nice, although I think many would disagree that a storage format should be based on ease of technical access.

Try JDF.type_compress!(df, compress_float=true)

1 Like