So I had this idea, and I was going to make an issue on a requisite package for it, then I realized the context of it is potentially broader then any single package.
So in the geospatial domain a trick for reducing file size is to cleave off floating point noise from the digits that are significant in all stored files (well known trick). Now, in many other domains this is also true! Some datastores only need 4-8 byte precision, but after say manipulation they end up with all this floating point junk.
Is there a foreseeable benefit for a widespread pattern to be implemented in say CSV.jl, JSON.jl, XLSX.jl, JuliaDB.jl, parquet.jl, etc for allowing a “Significant digit” truncation? I realize this is one of those semidangerous things, if an end user “thinks they know” the truncation level and messes up well that sucks for them, but if you genuinely know, it’s kind of a godsend for large swaths of I/O.
I could imagine some type wrappers for some of these columnar I/O libraries specifying the significant digits for each column to save on bytes. Maybe some formats are already handling this to some extent?
Anyways, just a passing thought I had while trying to optimize my own ZMQ streams…
Can’t you just transform to
Float32 or similar, and serialize that?
That’s one possibility definitely. Maybe this is a moot point actually. I don’t know the guts of how JSON/CSV reads plain text and turns it into native types. I suppose it probably isn’t doing anything at the byte level so probably no performance gains. Then, for binary file stores, I suppose skipping the first N bytes or trailing N bytes isn’t a huge gain either. Maybe someone with more knowledge of the internals knows better that this is pointless?
For parsing string types this nice though. Less char reads, smaller disk & memory footprint.
I think it could help reduce file size for plain text formats (such as JSON). Maybe one could just convert (or round) to a fixed point type like https://github.com/JuliaMath/FixedPointDecimals.jl and then write it to a file from there?
Beautiful plots! For the case of 2-D plots, we could easily determine pixel resolution and use that as a cut off. 3-D would be harder. I can foresee a use case here for some file types and applications. I’m just not exposed to the internals of all the IO libraries like some people are. Nor am I computer scientist and never tried to play one on TV.
Hmmm… Fixed point decimals could be handy for this.
If you care about efficiency that much you should maybe be using binary formats, not text formats like CSV or JSON. Of course, transformating to a lower precision for storage is often a good option for binary formats as well.
The laszip compression uses this principle to attain impressive compression ratios.
@stevengj - I agree with that, and that’s kind of where I am looking once I find a nice route to do that with. Maybe BSON for me personally, but we’ll see. But, for some people, CSV’s and JSON are nice because they don’t require any special or seasonal libraries to open them up and see whats inside. Hundreds of thousands of people use those files daily to perform rudimentary analysis’ and it’s a drain on the world to store noise/null values. The main niche I guess is someone boss could open them in excel, or notepad and see something, and corrupting plain text is pretty difficult. For archival purposes that’s kinda nice, although I think many would disagree that a storage format should be based on ease of technical access.