Reading parquet very slow

freeman · June 14, 2020, 12:10am

I have a parquet file, about 100mb compressed.

In python, reading it with pd.read_parquet(path) takes about 1 second. The file is 12GB in memory.

In Julia this:

    using DataFrames, ParquetFiles
    df = DataFrame(load(path))

takes I don’t even know how long (about 15 minutes?)

There’s no columns with strings or other strange things. Running

eltype.(eachcol(df))

shows that all columns are either Vector{Int64} or Vector{Float64}. There’s 22 columns.

Suggestions on how to go about identifying what’s wrong?

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4

xiaodai · June 14, 2020, 3:24am

Yeah. ParquetFiles is still using the slow method. There is a WIP PR on a faster reader WIP: Column-based faster reader of unnested columns by xiaodaigh · Pull Request #79 · JuliaIO/Parquet.jl · GitHub

Or you can use the BatchedColumnCursor in GitHub - JuliaIO/Parquet.jl: Julia implementation of Parquet columnar file format reader but it’s not as user friendly as a stratight read_parquet.

One issue is funding, I am happy to prioritise the PR for a faster reader if it you are able to help secure some funding e.g. from your company. Please DM me if it’s a possibility, my rates are very reasonable.

If not then, I will get to it I am done with my other OS work which is JLBoostMLJ.jl and other loose ends at the moment.

freeman · June 14, 2020, 8:39am

Sadly, I have no pull with my organization.

In any case, thanks for coming back on this. Below are some observations.

ParquetFiles is compatible with Parquet 0.4.0 whereas Parquet latest is 0.6.0. So I got rid of ParquetFiles and upgraded Parquet to 0.6.0. After this I tried your suggestion: manually instantiating the cursors. Using

p = ParFile(path)
cc = Parquet.BatchedColumnsCursor(p)
batchvals,state = iterate(cc)

I can now feed batchvals into the DataFrame constructor. This takes about 3 seconds to load this file. Still about 5x slower than pd.read_parquet, but down from 15 minutes

One thing that I noticed is that all columns are detected to be of type either Union{missing,Int64} or Union{missing,Float64}, where in reality there are no missing values. The method using ParquetFiles detected this correctly. Do you think it would make sense to open an issue on this?

xiaodai · June 14, 2020, 8:46am

Not sure what’s been done, but my impression was that ParquetFiles.jl can’t deal with missing. I might be wrong. But best to check.

This is very interesting. In R and Python, they write into parquet file information that lets you determine if the column has missing, e.g. by writing how many missing values are in the column. HOwever, Parquet has no natural way to telling you something is missing, so the safest option is Unino{Missing, T}.

We can make the Parquet.jl do the same thing as Python and R and look for missing count in the metadata, and use Vector{T} if no missing is recorded. This would be ideal and can be implemented at some point in the future.

The key issue is that Parquet doesn’t require the missing count to be put into the parquet file so the missing count may not be there even if you try to look. Just interesting tidbits.

Yeah. I have offered to help. Either the maintainer is too busy or has decided to not take up my help.

xiaodai · June 14, 2020, 8:48am

Helping to raise awareness also helps. OS funding is a big issue in general.

Topic		Replies	Views
Problem reading parquet file in Julia 1.9.0 Performance question , package	0	239	May 24, 2023
File IO - Parquet File Reader Data	4	1200	October 30, 2018
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	872	August 31, 2022
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7450	May 8, 2024
Writing Parquet files General Usage	28	5261	November 12, 2020

Reading parquet very slow

Related topics