Reading parquet very slow

I have a parquet file, about 100mb compressed.

In python, reading it with pd.read_parquet(path) takes about 1 second. The file is 12GB in memory.

In Julia this:

    using DataFrames, ParquetFiles
    df = DataFrame(load(path))

takes I don’t even know how long (about 15 minutes?)

There’s no columns with strings or other strange things. Running

eltype.(eachcol(df))

shows that all columns are either Vector{Int64} or Vector{Float64}. There’s 22 columns.

Suggestions on how to go about identifying what’s wrong?

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4

Yeah. ParquetFiles is still using the slow method. There is a WIP PR on a faster reader WIP: Column-based faster reader of unnested columns by xiaodaigh · Pull Request #79 · JuliaIO/Parquet.jl · GitHub

Or you can use the BatchedColumnCursor in GitHub - JuliaIO/Parquet.jl: Julia implementation of parquet columnar file format reader but it’s not as user friendly as a stratight read_parquet.

One issue is funding, I am happy to prioritise the PR for a faster reader if it you are able to help secure some funding e.g. from your company. Please DM me if it’s a possibility, my rates are very reasonable.

If not then, I will get to it I am done with my other OS work which is JLBoostMLJ.jl and other loose ends at the moment.

1 Like

Sadly, I have no pull with my organization.

In any case, thanks for coming back on this. Below are some observations.

ParquetFiles is compatible with Parquet 0.4.0 whereas Parquet latest is 0.6.0. So I got rid of ParquetFiles and upgraded Parquet to 0.6.0. After this I tried your suggestion: manually instantiating the cursors. Using

p = ParFile(path)
cc = Parquet.BatchedColumnsCursor(p)
batchvals,state = iterate(cc)

I can now feed batchvals into the DataFrame constructor. This takes about 3 seconds to load this file. Still about 5x slower than pd.read_parquet, but down from 15 minutes :slight_smile:

One thing that I noticed is that all columns are detected to be of type either Union{missing,Int64} or Union{missing,Float64}, where in reality there are no missing values. The method using ParquetFiles detected this correctly. Do you think it would make sense to open an issue on this?

2 Likes

Not sure what’s been done, but my impression was that ParquetFiles.jl can’t deal with missing. I might be wrong. But best to check.

This is very interesting. In R and Python, they write into parquet file information that lets you determine if the column has missing, e.g. by writing how many missing values are in the column. HOwever, Parquet has no natural way to telling you something is missing, so the safest option is Unino{Missing, T}.

We can make the Parquet.jl do the same thing as Python and R and look for missing count in the metadata, and use Vector{T} if no missing is recorded. This would be ideal and can be implemented at some point in the future.

The key issue is that Parquet doesn’t require the missing count to be put into the parquet file so the missing count may not be there even if you try to look. Just interesting tidbits.

Yeah. I have offered to help. Either the maintainer is too busy or has decided to not take up my help.

Helping to raise awareness also helps. OS funding is a big issue in general.