Reading Parquet file into Apache Arrow?

s-kap · November 27, 2020, 8:37am

Hey hey, what’s the easiest way to read a parquet file into an apache arrow dataset? In python one can simply do pyarrow.parquet.read_table(..). I know Parquet.jl and Arrow.jl exist, but I haven’t found a way to make the two work together.

My goal is to read the parquet file into a dataframe and perform some benchmarks comparing regular DataFrames vs DataFrames using Apache Arrow vecs.

The only way I see to convert Parquet files to a DataFrame is by using the RecordCursor, which doesn’t seem ideal because parquet is a columnar format and so are DataFrames and Apache Arrow - and having to iterate over rows is really slow

nalimilan · November 27, 2020, 8:47am

I think you can use ParquetFiles.jl.

Rudi79 · November 27, 2020, 9:00am

I used https://github.com/lungben/TableIO.jl for a similar task.

lungben · November 27, 2020, 10:15am

TableIO uses Parquet.jl internally for this task.

s-kap · November 27, 2020, 11:16am

ParquetFiles was last updated about 2 years ago and doesn’t work anymore

s-kap · November 27, 2020, 11:17am

Thanks!
This looks great!

Topic		Replies	Views
Displaying a parquet file in Arrow New to Julia dataframes , parquet , arrow	7	1562	March 17, 2021
Reading and writing Apache arrow files General Usage question , package , arrow	4	5759	May 28, 2022
Best way to get pieces of a parquet into a dataframe Data	0	623	September 18, 2017
An example of Apache Arrow file? Data arrow	7	2885	April 22, 2021
Unable to write DataFrame to Parquet or Arrow? Data question	7	607	July 27, 2021

Reading Parquet file into Apache Arrow?

Related topics