Suggestion for sometimes faster data serialization

anon92994695 · December 15, 2019, 1:28pm

Hey Julians,

So I had this idea, and I was going to make an issue on a requisite package for it, then I realized the context of it is potentially broader then any single package.

So in the geospatial domain a trick for reducing file size is to cleave off floating point noise from the digits that are significant in all stored files (well known trick). Now, in many other domains this is also true! Some datastores only need 4-8 byte precision, but after say manipulation they end up with all this floating point junk.

Is there a foreseeable benefit for a widespread pattern to be implemented in say CSV.jl, JSON.jl, XLSX.jl, JuliaDB.jl, parquet.jl, etc for allowing a “Significant digit” truncation? I realize this is one of those semidangerous things, if an end user “thinks they know” the truncation level and messes up well that sucks for them, but if you genuinely know, it’s kind of a godsend for large swaths of I/O.

I could imagine some type wrappers for some of these columnar I/O libraries specifying the significant digits for each column to save on bytes. Maybe some formats are already handling this to some extent?

Anyways, just a passing thought I had while trying to optimize my own ZMQ streams…

Tamas_Papp · December 15, 2019, 1:32pm

Can’t you just transform to Float32 or similar, and serialize that?

anon92994695 · December 15, 2019, 1:34pm

That’s one possibility definitely. Maybe this is a moot point actually. I don’t know the guts of how JSON/CSV reads plain text and turns it into native types. I suppose it probably isn’t doing anything at the byte level so probably no performance gains. Then, for binary file stores, I suppose skipping the first N bytes or trailing N bytes isn’t a huge gain either. Maybe someone with more knowledge of the internals knows better that this is pointless?

For parsing string types this nice though. Less char reads, smaller disk & memory footprint.

ericphanson · December 15, 2019, 2:17pm

I think it could help reduce file size for plain text formats (such as JSON). Maybe one could just convert (or round) to a fixed point type like https://github.com/JuliaMath/FixedPointDecimals.jl and then write it to a file from there?

I have some rotatable 3D plots at the bottom of this post Locally maximizing the Rényi entropies | EPH that I made a couple years ago with Plots.jl and the plotlyjs backend (one of my first experiments with Julia!) and when that page loads, the browser downloads some (plaintext) .js files with hardcoded data points each specified to 16 digits, which seems pretty wasteful. I’m sure there’s a better way by writing some JavaScript to load the data from a binary format, but this was the easiest— and anyway, I just wanted to show a collaborator some 3D plots. I think it would cut down on file size without affecting the plot quality just by truncating to 5 or 6 digits. So to my mind there’s definitely a use-case for the technique.

anon92994695 · December 15, 2019, 2:22pm

Beautiful plots! For the case of 2-D plots, we could easily determine pixel resolution and use that as a cut off. 3-D would be harder. I can foresee a use case here for some file types and applications. I’m just not exposed to the internals of all the IO libraries like some people are. Nor am I computer scientist and never tried to play one on TV.

Hmmm… Fixed point decimals could be handy for this.

stevengj · December 15, 2019, 4:58pm

If you care about efficiency that much you should maybe be using binary formats, not text formats like CSV or JSON. Of course, transformating to a lower precision for storage is often a good option for binary formats as well.

joa-quim · December 15, 2019, 5:53pm

The laszip compression uses this principle to attain impressive compression ratios.

anon92994695 · December 15, 2019, 6:32pm

@stevengj - I agree with that, and that’s kind of where I am looking once I find a nice route to do that with. Maybe BSON for me personally, but we’ll see. But, for some people, CSV’s and JSON are nice because they don’t require any special or seasonal libraries to open them up and see whats inside. Hundreds of thousands of people use those files daily to perform rudimentary analysis’ and it’s a drain on the world to store noise/null values. The main niche I guess is someone boss could open them in excel, or notepad and see something, and corrupting plain text is pretty difficult. For archival purposes that’s kinda nice, although I think many would disagree that a storage format should be based on ease of technical access.

xiaodai · December 15, 2019, 9:17pm

Try JDF.type_compress!(df, compress_float=true)

https://github.com/xiaodaigh/JDF.jl

Topic		Replies	Views
Data Import Types and Compression New to Julia question , data-compression	1	1167	December 22, 2017
Exact numerical logging General Usage question	25	1213	April 15, 2021
JSON3 - How write(serialize) the specified number of significant digits? General Usage	3	779	October 2, 2020
How to save an array to disk in compressed form? General Usage question , data-compression	9	2858	January 24, 2023
Saving julia dataframes efficiently (in terms of size on the disk) General Usage	13	1509	January 19, 2020

Suggestion for sometimes faster data serialization

Related topics