Parquet on M1

Hello Everybody

Parquet files are great, and we happily read and write them in Julia on Linux.

But not on a Mac M1, there are 2 competing libraries, Expanding Man / Parquet2.jl · GitLab and GitHub - JuliaIO/Parquet.jl: Julia implementation of Parquet columnar file format reader both of which depend on snappy which does not work on a M1. Previously discussed M1 Mac Faile to install BinaryProvider Package - #2 by giordano

We solved the problem by using RCall to get R to read parquet files and then send them to Julia, but that is very clumsy and slow.

So was just wondering if there was some other Parquet implementation in Julia that works on M1?

best, Jack.

3 Likes

You’ll have to ask @ExpandingMan what’s the problem with libsnappy 1.1.9: updated to use Clang.jl... still baffling errors by ExpandingMan · Pull Request #33 · JuliaIO/Snappy.jl · GitHub. Also, Parquet.jl should allow Snappy 0.4 at
Parquet.jl/Project.toml at ab6d68278713b400465b9d294736c89dfbc30f35 · JuliaIO/Parquet.jl · GitHub

Also because if I dev Snappy.jl to allow snappy_jll 1.1.9 and dev Parquet to allow Snappy 0.4.0, then tests of both Snappy and Parquet are successful for me on M1:

     Testing Running tests...
Test Summary: | Pass  Total   Time
parquet tests | 3910   3910  21.3s
     Testing Parquet tests passed

[...]

     Testing Running tests...
Test Summary:        | Pass  Total  Time
Low-level Interfaces |   54     54  0.1s
Test Summary:         | Pass  Total  Time
High-level Interfaces |  113    113  0.7s
     Testing Snappy tests passed

So I have no idea what was so catastrophically wrong.

1 Like

Thanks @giordano

yes I take it one can dev it, but is a bit worrying that one needs to resort to such tricks for production data flow, and manually making it happen.
But we will try it on one machine.

best, jack

I opened a PR to use snappy_jll 1.1.9: Allow installation of `snappy_jll` 1.1.9 by giordano · Pull Request #35 · JuliaIO/Snappy.jl · GitHub. Tests are indeed failing on Linux, but only on that platform. And only a single test. I also have zero knowledge of the package and less than zero interest in pursuing a fix, I never used it and have no need for it.

1 Like

@giordano thanks!

I can see that. Seems unfortunate that such important data science package as reading and writing Parquet files depend on packages that are so poorly maintained. I wish we knew more Julia and hence be able to rectify this.

@bkamins What do you think?

Despite my zero interest in snappy, I’m digging into it. I believe it’s an upstream bug in GitHub - google/snappy: A fast compressor/decompressor which has been fixed in the development version of the library.

2 Likes

Yes, I can confirm using this upstream patch solves the issue for me on Linux.

3 Likes

It seems strange that any open source project that wants to work across multiple platforms would make Snappy a required dependency given this statement on the Snappy lib Github:

We are unlikely to accept contributions to the build configuration files, such as CMakeLists.txt. We are focused on maintaining a build configuration that allows us to test that the project works in a few supported configurations inside Google. We are not currently interested in supporting other requirements, such as different operating systems, compilers, or build systems.

That seems pretty explicit that there is going to be limited support for this compression library for non-Google users…

2 Likes

@ExpandingMan - do we need to depend on Snappy?

In short: yes.

The compressed buffers are buried pretty deep in the parquet format, it’s not just a matter of compressing an entire file or even compressing the entire file minus a header.

This is great, thanks!

Tests are passing for me now on Parquet2 with snappy 1.1.9. Presumably you can now use it on any architecture that 1.1.9 is built for.

1 Like

Fixed by

11 Likes