Parquet on M1

JackStrauss · June 24, 2022, 10:00pm

Hello Everybody

Parquet files are great, and we happily read and write them in Julia on Linux.

But not on a Mac M1, there are 2 competing libraries, Expanding Man / Parquet2.jl · GitLab and GitHub - JuliaIO/Parquet.jl: Julia implementation of Parquet columnar file format reader both of which depend on snappy which does not work on a M1. Previously discussed M1 Mac Faile to install BinaryProvider Package - #2 by giordano

We solved the problem by using RCall to get R to read parquet files and then send them to Julia, but that is very clumsy and slow.

So was just wondering if there was some other Parquet implementation in Julia that works on M1?

best, Jack.

giordano · June 24, 2022, 10:20pm

You’ll have to ask @ExpandingMan what’s the problem with libsnappy 1.1.9: updated to use Clang.jl... still baffling errors by ExpandingMan · Pull Request #33 · JuliaIO/Snappy.jl · GitHub. Also, Parquet.jl should allow Snappy 0.4 at
Parquet.jl/Project.toml at ab6d68278713b400465b9d294736c89dfbc30f35 · JuliaIO/Parquet.jl · GitHub

giordano · June 24, 2022, 10:28pm

Also because if I dev Snappy.jl to allow snappy_jll 1.1.9 and dev Parquet to allow Snappy 0.4.0, then tests of both Snappy and Parquet are successful for me on M1:

     Testing Running tests...
Test Summary: | Pass  Total   Time
parquet tests | 3910   3910  21.3s
     Testing Parquet tests passed

[...]

     Testing Running tests...
Test Summary:        | Pass  Total  Time
Low-level Interfaces |   54     54  0.1s
Test Summary:         | Pass  Total  Time
High-level Interfaces |  113    113  0.7s
     Testing Snappy tests passed

So I have no idea what was so catastrophically wrong.

JackStrauss · June 25, 2022, 8:55am

Thanks @giordano

yes I take it one can dev it, but is a bit worrying that one needs to resort to such tricks for production data flow, and manually making it happen.
But we will try it on one machine.

best, jack

giordano · June 25, 2022, 9:05am

I opened a PR to use snappy_jll 1.1.9: Allow installation of `snappy_jll` 1.1.9 by giordano · Pull Request #35 · JuliaIO/Snappy.jl · GitHub. Tests are indeed failing on Linux, but only on that platform. And only a single test. I also have zero knowledge of the package and less than zero interest in pursuing a fix, I never used it and have no need for it.

JackStrauss · June 25, 2022, 11:28am

@giordano thanks!

I can see that. Seems unfortunate that such important data science package as reading and writing Parquet files depend on packages that are so poorly maintained. I wish we knew more Julia and hence be able to rectify this.

@bkamins What do you think?

giordano · June 25, 2022, 11:36am

Despite my zero interest in snappy, I’m digging into it. I believe it’s an upstream bug in GitHub - google/snappy: A fast compressor/decompressor which has been fixed in the development version of the library.

giordano · June 25, 2022, 11:44am

Yes, I can confirm using this upstream patch solves the issue for me on Linux.

isaacsas · June 25, 2022, 11:47am

It seems strange that any open source project that wants to work across multiple platforms would make Snappy a required dependency given this statement on the Snappy lib Github:

We are unlikely to accept contributions to the build configuration files, such as CMakeLists.txt. We are focused on maintaining a build configuration that allows us to test that the project works in a few supported configurations inside Google. We are not currently interested in supporting other requirements, such as different operating systems, compilers, or build systems.

That seems pretty explicit that there is going to be limited support for this compression library for non-Google users…

bkamins · June 25, 2022, 12:55pm

@ExpandingMan - do we need to depend on Snappy?

ExpandingMan · June 25, 2022, 1:48pm

In short: yes.

The compressed buffers are buried pretty deep in the parquet format, it’s not just a matter of compressing an entire file or even compressing the entire file minus a header.

This is great, thanks!

Tests are passing for me now on Parquet2 with snappy 1.1.9. Presumably you can now use it on any architecture that 1.1.9 is built for.

giordano · June 25, 2022, 1:52pm

Fixed by

[snappy] Add upstream patch to fix compilation of asm statements by giordano · Pull Request #5067 · JuliaPackaging/Yggdrasil · GitHub
Allow installation of `snappy_jll` 1.1.9 by giordano · Pull Request #35 · JuliaIO/Snappy.jl · GitHub
Add compat for `Snappy` v0.4 by giordano · Pull Request #166 · JuliaIO/Parquet.jl · GitHub

Topic		Replies	Views
M1 Mac Faile to install BinaryProvider Package General Usage	19	2474	February 16, 2024
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	867	August 31, 2022
The poor state of fileformats for High Performance computing General Usage	16	2619	August 13, 2017
File IO - Parquet File Reader Data	4	1200	October 30, 2018
Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project General Usage	57	3241	May 6, 2024

Parquet on M1

Related topics