Neither Parquet.jl nor Parquet2.jl can read my .parquet file

I am trying to load a parquet file (about 1Gb) on Julia 1.8, and neither Parquet.read_parquet nor Parquet2.readfile are able to open it. The former just hangs forever until I kill it, and the latter gives me a very long error dump. Meanwhile pandas .read_parquet() can load it in seconds. Is there another package I can use instead of these two?

You can use Pandas! I see Pandas.jl has read_parquet, so it should work. It used PyCall. It might be better to use Pandas from PythonCall.jl.

I doubt there’s a different (pure) Julia solution, so I’m pointing out the above as a workaround, until @ExpandingMan’s Parquet2.jl is fixed. You could file an issue and/or try to debug. There might be other non-Julia solutions, likely in e.g. C or C++, but unclear they would be better or faster to use than from Python (likely implemented in either of those languages anyway).

Can you share a minimal sample of the file that is known to break? It is usually very hard to fix these things without that, though the full stack trace might give me some insight.

Yes, pyarrow is C++ and fastparquet is mostly Cython.

1 Like

Not sure how to share MWE, since the file is pretty large and if I read into Pandas to subset rows and rewrite, then Parquet2 can open it.

Here is the full stack trace

Stacktrace:
  [1] getcompressor(#unused#::Val{:lz4_raw})
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/compression.jl:26
  [2] getcompressor(s::Symbol)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/compression.jl:14
  [3] getcompressor(n::Int32)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/compression.jl:15
  [4] columndata(v::Parquet2.CacheVector{Parquet2.VectorFetcher}, mc::Parquet2.Metadata.Column, elsize::Int64; read_opts::Parquet2.ReadOptions)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/schema.jl:926
  [5] Parquet2.Column(v::Parquet2.CacheVector{Parquet2.VectorFetcher}, r::Parquet2.SchemaNode{Parquet2.ParqTree}, coldict::Dict{Vector{String}, Parquet2.Metadata.Column}, p::FilePathsBase.PosixPath, schp::Vector{String}, nvals::Int64; read_opts::Parquet2.ReadOptions)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/schema.jl:964
  [6] _construct_column(nm::String, r::Parquet2.SchemaNode{Parquet2.ParqTree}, coldict::Dict{Vector{String}, Parquet2.Metadata.Column}, fm::Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}, nvals::Int64, current_file::FilePathsBase.PosixPath; read_opts::Parquet2.ReadOptions)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/schema.jl:1232
  [7] #116
    @ ~/.julia/packages/Parquet2/xyD4G/src/schema.jl:1256 [inlined]
  [8] next (repeats 2 times)
    @ ~/.julia/packages/Transducers/HBMTc/src/library.jl:54 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/Transducers/HBMTc/src/core.jl:181 [inlined]
 [10] __foldl__(rf::Transducers.Reduction{Transducers.Map{typeof(Parquet2.name)}, Transducers.Reduction{Transducers.Map{Parquet2.var"#116#118"{FilePathsBase.PosixPath, Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}, Parquet2.SchemaNode{Parquet2.ParqTree}, Int64, Dict{Vector{String}, Parquet2.Metadata.Column}}}, Transducers.Reduction{Transducers.Map{Type{BangBang.NoBang.SingletonVector}}, Transducers.BottomRF{Transducers.AdHocRF{typeof(BangBang.collector), typeof(identity), typeof(BangBang.append!!), typeof(identity), typeof(identity), Nothing}}}}}, init::BangBang.SafeCollector{BangBang.NoBang.Empty{Vector{Union{}}}}, coll::Base.ValueIterator{OrderedCollections.OrderedDict{String, Parquet2.SchemaNode}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:157
 [11] #transduce#142
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:523 [inlined]
 [12] transduce
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:512 [inlined]
 [13] transduce(xform::Transducers.Composition{Transducers.Map{typeof(Parquet2.name)}, Transducers.Composition{Transducers.Map{Parquet2.var"#116#118"{FilePathsBase.PosixPath, Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}, Parquet2.SchemaNode{Parquet2.ParqTree}, Int64, Dict{Vector{String}, Parquet2.Metadata.Column}}}, Transducers.Map{Type{BangBang.NoBang.SingletonVector}}}}, f::Transducers.AdHocRF{typeof(BangBang.collector), typeof(identity), typeof(BangBang.append!!), typeof(identity), typeof(identity), Nothing}, init::BangBang.SafeCollector{BangBang.NoBang.Empty{Vector{Union{}}}}, coll::Base.ValueIterator{OrderedCollections.OrderedDict{String, Parquet2.SchemaNode}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:506
 [14] transduce
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:504 [inlined]
 [15] collect(xf::Transducers.Composition{Transducers.Map{typeof(Parquet2.name)}, Transducers.Map{Parquet2.var"#116#118"{FilePathsBase.PosixPath, Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}, Parquet2.SchemaNode{Parquet2.ParqTree}, Int64, Dict{Vector{String}, Parquet2.Metadata.Column}}}}, coll::Base.ValueIterator{OrderedCollections.OrderedDict{String, Parquet2.SchemaNode}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:807
 [16] collect
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:821 [inlined]
 [17] |>
    @ ./operators.jl:911 [inlined]
 [18] Parquet2.RowGroup(fm::Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}, r::Parquet2.SchemaNode{Parquet2.ParqTree}, rg::Parquet2.Metadata.RowGroup, ptree::Parquet2.PartitionNode{FilePathsBase.PosixPath}; current_file::FilePathsBase.PosixPath, parallel_column_loading::Nothing)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/schema.jl:1255
 [19] (::Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath})(rg::Parquet2.Metadata.RowGroup)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:162
 [20] next
    @ ~/.julia/packages/Transducers/HBMTc/src/library.jl:54 [inlined]
 [21] macro expansion
    @ ~/.julia/packages/Transducers/HBMTc/src/core.jl:181 [inlined]
 [22] _foldl_array
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:187 [inlined]
 [23] __foldl__
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:182 [inlined]
 [24] #transduce#142
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:523 [inlined]
 [25] transduce
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:512 [inlined]
 [26] transduce(xform::Transducers.Composition{Transducers.Map{Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath}}, Transducers.Map{Type{BangBang.NoBang.SingletonVector}}}, f::Transducers.AdHocRF{typeof(BangBang.collector), typeof(identity), typeof(BangBang.append!!), typeof(identity), typeof(identity), Nothing}, init::BangBang.SafeCollector{BangBang.NoBang.Empty{Vector{Union{}}}}, coll::Vector{Parquet2.Metadata.RowGroup}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:506
 [27] transduce(xform::Transducers.Composition{Transducers.Map{Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath}}, Transducers.Map{Type{BangBang.NoBang.SingletonVector}}}, f::Transducers.AdHocRF{typeof(BangBang.collector), typeof(identity), typeof(BangBang.append!!), typeof(identity), typeof(identity), Nothing}, init::BangBang.SafeCollector{BangBang.NoBang.Empty{Vector{Union{}}}}, coll::Vector{Parquet2.Metadata.RowGroup})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:504
 [28] collect(xf::Transducers.Map{Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath}}, coll::Vector{Parquet2.Metadata.RowGroup})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:807
 [29] collect(foldable::Transducers.Eduction{Transducers.Reduction{Transducers.Map{Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath}}, Transducers.BottomRF{Transducers.Completing{typeof(BangBang.push!!)}}}, Vector{Parquet2.Metadata.RowGroup}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:821
 [30] |>(x::Transducers.Eduction{Transducers.Reduction{Transducers.Map{Parquet2.var"#126#127"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, FilePathsBase.PosixPath}}, Transducers.BottomRF{Transducers.Completing{typeof(BangBang.push!!)}}}, Vector{Parquet2.Metadata.RowGroup}}, f::typeof(collect))
    @ Base ./operators.jl:911
 [31] append!(ds::Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}, p::FilePathsBase.PosixPath; verbose::Bool, check::Bool)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:163
 [32] append!
    @ ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:158 [inlined]
 [33] #144
    @ ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:229 [inlined]
 [34] next
    @ ~/.julia/packages/Transducers/HBMTc/src/library.jl:54 [inlined]
 [35] next
    @ ~/.julia/packages/Transducers/HBMTc/src/library.jl:290 [inlined]
 [36] macro expansion
    @ ~/.julia/packages/Transducers/HBMTc/src/core.jl:181 [inlined]
 [37] _foldl_array
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:187 [inlined]
 [38] __foldl__
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:182 [inlined]
 [39] #transduce#142
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:523 [inlined]
 [40] transduce
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:512 [inlined]
 [41] transduce(xform::Transducers.Composition{Transducers.Filter{Parquet2.var"#143#145"{Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}}}, Transducers.Map{Parquet2.var"#144#146"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}}}}, f::Transducers.Completing{typeof(Transducers.right)}, init::Vector{Any}, coll::Vector{FilePathsBase.PosixPath}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Transducers ~/.julia/packages/Transducers/HBMTc/src/processes.jl:506
 [42] transduce
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:504 [inlined]
 [43] #foldxl#147
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:972 [inlined]
 [44] #foldl#148
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:975 [inlined]
 [45] #foldxl#140
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:446 [inlined]
 [46] #138
    @ ~/.julia/packages/Transducers/HBMTc/src/processes.jl:445 [inlined]
 [47] |>
    @ ./operators.jl:911 [inlined]
 [48] appendall!(ds::Parquet2.Dataset{Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher}}; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:229
 [49] appendall!
    @ ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:228 [inlined]
 [50] Parquet2.Dataset(fm::Parquet2.FileManager{FilePathsBase.PosixPath, Parquet2.VectorFetcher})
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:110
 [51] Parquet2.Dataset(p::FilePathsBase.PosixPath; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:115
 [52] Parquet2.Dataset(p::FilePathsBase.PosixPath)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:113
 [53] Parquet2.Dataset(p::String; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:119
 [54] Dataset
    @ ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:119 [inlined]
 [55] #readfile#150
    @ ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:363 [inlined]
 [56] readfile(a::String)
    @ Parquet2 ~/.julia/packages/Parquet2/xyD4G/src/dataset.jl:363

Ah! This is trying to use a variant of lz4 compression codec which is not yet supported (both lz4 codecs seem pretty rare, it seems the former is deprecated).

I’ll look into adding LZ4 shouldn’t take long unless there are unforeseen complications.

4 Likes

That would be great; in the meantime, just a user-friendly one-line “ERROR: lz4-variant coded not supported” (maybe even pointing to Pandas.jl alternative, though I would understand if you do not want to to that in an error message…), rather than unfriendly long stack-trace?

There’s already a dedicated error for unsupported compression format but it was not displaying correctly, this has been fixed.

Ok, I’ve implemented this now in 0.1.12 (you can clone latest master if you don’t want to wait for it to tag).

Some warning: for whatever reason I was not able to get tests to pass for files written with LZ4 via fastparquet, though pyarrow files seem to work fine and tests for it with LZ4 are now included. The error is an opaque decompression error so it’s hard to know what’s happening. There may be something fishy going on with the LZ4 codec in general since the original (deprecated version) is undocumented and its replacement claims to use the LZ4 “block format” and it’s not entirely clear to me that this corresponds to the default format expected by the lib for decompression with no additional framing.

Anyway, let me know if this works, if not I might have to dig into why fastparquet output can’t be read.

4 Likes

wow, quick response! Thank you, I will try it out

edit: works!

1 Like