Unable to write DataFrame to Parquet or Arrow?

Hi, I’m trying to save a large DataFrame. Long story short, only the CSV and Feather packages seem to work at all, but I would prefer Arrow or Parquet. I’ve installed Parquet.jl and Arrow.jl, but they fail to work. I assume I messed something up.

It can’t even find Parquet.write_parquet, while Arrow.write(path, df) gives the following error: MethodError: no method matching write(::IOStream, ::DataFrame)

Here’s the Project.toml and the notebook that I’m trying to run (see the end).

This is a new Julia installation, and I’ve definitely activated the environment - what am I doing wrong?

I suspect some of your packages are holding it back

]status to see what version of Parquet.jl you have. i suspect something like Feather.jl is holding you back from getting the latest version of Parquet.jl

Remove Feather.jl first like ]rm Feather.

Now, also if you only want to use the data from Julia and not worry about future breaking changes you can try JDF.jl which saves the data quite quickly.

1 Like

Yes, looks like a version problem! Removing just Feather didn’t help, but I’ll try to figure it out. I’ll probably try from scratch…

(m5-competition) pkg> status
      Status `~/repos/personal/m5-competition/Project.toml`
  [b5ca4192] AdvancedVI v0.1.3
  [69666777] Arrow v0.2.4
  [76274a88] Bijectors v0.9.7
  [336ed68f] CSV v0.8.5
  [052768ef] CUDA v2.6.3
  [324d7699] CategoricalArrays v0.8.3
  [8be319e6] Chain v0.4.7
  [a93c6f00] DataFrames v1.2.1
  [b4f34e82] Distances v0.10.3
  [31c24e10] Distributions v0.23.12
  [bbc10e6e] DynamicHMC v3.1.0
  [becb17da] Feather v0.5.9
  [5789e2e9] FileIO v1.10.1
  [8fc22ac5] FilePaths v0.8.3
  [48062228] FilePathsBase v0.9.10
  [587475ba] Flux v0.12.1
  [38e38edf] GLM v1.5.1
  [7073ff75] IJulia v1.23.2
  [c7f686f2] MCMCChains v4.13.1
  [cc2ba9b6] MLDataUtils v0.5.4
  [add582a8] MLJ v0.16.5
  [6fafb56a] Memoization v0.1.13
  [429524aa] Optim v1.3.0
  [626c502c] Parquet v0.4.0
  [58dd65bb] Plotly v0.3.0
  [91a5bcdd] Plots v0.29.9
  [438e738f] PyCall v1.92.3
  [612083be] Queryverse v0.7.0
  [ce6b1742] RDatasets v0.7.5
  [37e2e3b7] ReverseDiff v1.9.0
  [3646fa90] ScikitLearn v0.6.4
  [60ddc479] StatPlots v0.9.2
  [2913bbd2] StatsBase v0.33.8
  [4c63d2b9] StatsFuns v0.9.8
  [f3b207a7] StatsPlots v0.14.26
  [bd369af6] Tables v1.4.4
  [fce5fe82] Turing v0.16.6
  [e88e6eb3] Zygote v0.6.17

Just do add Arrow@1.4 to see what’s holding it back

Turns out there is StatsPlots and StatPlots, with the second one being an abandoned imposter… Thanks for the help!

Actually not an imposter, just the old name of the package - there were quite a few discussions around this, see e.g. here: Don't fix now: Eventually unregister StatPlots (when 0.6 is completely out of use) · Issue #225 · JuliaPlots/StatsPlots.jl · GitHub

Although as of this: Cap the compatibility of all versions of StatPlots to Julia ≤ 1.6 by DilumAluthge · Pull Request #39292 · JuliaRegistries/General · GitHub you shouldn’t be able to install StatPlots on Julia 1.7+ so hopefully this won’t be an issue for future users.

2 Likes

Turns out it was actually Queryverse that breaks it! ]add Queryverse resulted in lots of changes, including:

↓ Parquet v0.8.3 ⇒ v0.4.0

ah, could be because Queryverse is still using ParquetFiles.jl which depends on an old version of Parquet.jl