Hello Everybody
Parquet files are great, and we happily read and write them in Julia on Linux.
But not on a Mac M1, there are 2 competing libraries, Expanding Man / Parquet2.jl · GitLab and GitHub - JuliaIO/Parquet.jl: Julia implementation of Parquet columnar file format reader both of which depend on snappy which does not work on a M1. Previously discussed M1 Mac Faile to install BinaryProvider Package - #2 by giordano
We solved the problem by using RCall to get R to read parquet files and then send them to Julia, but that is very clumsy and slow.
So was just wondering if there was some other Parquet implementation in Julia that works on M1?
best, Jack.
3 Likes
Also because if I dev Snappy.jl to allow snappy_jll 1.1.9 and dev Parquet to allow Snappy 0.4.0, then tests of both Snappy and Parquet are successful for me on M1:
Testing Running tests...
Test Summary: | Pass Total Time
parquet tests | 3910 3910 21.3s
Testing Parquet tests passed
[...]
Testing Running tests...
Test Summary: | Pass Total Time
Low-level Interfaces | 54 54 0.1s
Test Summary: | Pass Total Time
High-level Interfaces | 113 113 0.7s
Testing Snappy tests passed
So I have no idea what was so catastrophically wrong.
1 Like
Thanks @giordano
yes I take it one can dev it, but is a bit worrying that one needs to resort to such tricks for production data flow, and manually making it happen.
But we will try it on one machine.
best, jack
I opened a PR to use snappy_jll 1.1.9: Allow installation of `snappy_jll` 1.1.9 by giordano · Pull Request #35 · JuliaIO/Snappy.jl · GitHub. Tests are indeed failing on Linux, but only on that platform. And only a single test. I also have zero knowledge of the package and less than zero interest in pursuing a fix, I never used it and have no need for it.
1 Like
@giordano thanks!
I can see that. Seems unfortunate that such important data science package as reading and writing Parquet files depend on packages that are so poorly maintained. I wish we knew more Julia and hence be able to rectify this.
@bkamins What do you think?
Despite my zero interest in snappy, I’m digging into it. I believe it’s an upstream bug in GitHub - google/snappy: A fast compressor/decompressor which has been fixed in the development version of the library.
2 Likes
Yes, I can confirm using this upstream patch solves the issue for me on Linux.
3 Likes
It seems strange that any open source project that wants to work across multiple platforms would make Snappy a required dependency given this statement on the Snappy lib Github:
We are unlikely to accept contributions to the build configuration files, such as CMakeLists.txt
. We are focused on maintaining a build configuration that allows us to test that the project works in a few supported configurations inside Google. We are not currently interested in supporting other requirements, such as different operating systems, compilers, or build systems.
That seems pretty explicit that there is going to be limited support for this compression library for non-Google users…
2 Likes
@ExpandingMan - do we need to depend on Snappy?
In short: yes.
The compressed buffers are buried pretty deep in the parquet format, it’s not just a matter of compressing an entire file or even compressing the entire file minus a header.
This is great, thanks!
Tests are passing for me now on Parquet2 with snappy 1.1.9. Presumably you can now use it on any architecture that 1.1.9 is built for.
1 Like