Reading Parquet file into Apache Arrow?

Hey hey, what’s the easiest way to read a parquet file into an apache arrow dataset? In python one can simply do pyarrow.parquet.read_table(..). I know Parquet.jl and Arrow.jl exist, but I haven’t found a way to make the two work together.

My goal is to read the parquet file into a dataframe and perform some benchmarks comparing regular DataFrames vs DataFrames using Apache Arrow vecs.

The only way I see to convert Parquet files to a DataFrame is by using the RecordCursor, which doesn’t seem ideal because parquet is a columnar format and so are DataFrames and Apache Arrow - and having to iterate over rows is really slow

I think you can use ParquetFiles.jl.

1 Like

I used https://github.com/lungben/TableIO.jl for a similar task.

1 Like

TableIO uses Parquet.jl internally for this task.

1 Like

ParquetFiles was last updated about 2 years ago and doesn’t work anymore :frowning:

Thanks!
This looks great!