How to read a parquet file applying a filter?

laygr · May 8, 2024, 5:39am

In python, we can do:

import pyarrow.parquet as pq
filter = ('some_column', '=', 'some_value')
pq.read_table(file_path, filters=[filter]).to_pandas()

How can I apply a filter similarly when reading a parquet file in Julia?

I’ve tried:

using Parquet
filter = row -> row.some_column == "some_value"
Parquet.read_parquet(filepath, filter=filter)

But I get an error saying that there is no such method:

Closest candidates are:
Parquet.Table(::Any, ::Parquet.File, ::Tables.Schema; rows, batchsize, column_generator, use_threads) got unsupported keyword argument “filter”

But, according to the source of the Parquet package, the filter option should be supported… it says:

filter: Filter function to apply while loading only a subset of partitions from a dataset. The path to the partition is provided as a parameter.

The only thing that I’ve got working is:

using Parquet2

return Parquet2.Dataset(file_path) |> TableOperations.filter(r -> Tables.getcolumn(r, :some_column) == some_value) |> DataFrames.DataFrame

But this is much slower than the Python solution.

rdavis120 · May 8, 2024, 7:56am

Maybe try something like this:

using DuckDB, DataFrames
db = DuckDB.DB()
DuckDB.query(db, "select * from '/path/*.parquet' where some_column = some_value") |>  DataFrames.DataFrame

laygr · May 8, 2024, 6:08pm

Yeah, there is a slight but noticeable speedup. Thanks!
to the rescue.

Do you know how I could read in chunks using DuckDB?
Some parquet files don’t fit in my RAM even after applying the WHERE clause.

rdavis120 · May 8, 2024, 6:51pm

Yes it is supported by the duckdb library and the Julia client api. You can see an example here in tests.

The issue is that the Tables partitions interface is not yet supported in the client api. If someone knew the tables partition interface well enough to contribute, it would help.

laygr · May 9, 2024, 7:12pm

I appreciate the example, but I’m not getting the expected behavior.
I ran the test and it blew my RAM because instead of streaming the result in chunks, it tried to return the 1000000000000 rows in a single chunk.

Topic		Replies	Views
File IO - Parquet File Reader Data	4	1200	October 30, 2018
Reading parquet very slow Data	4	3382	June 14, 2020
Julia Database or CSV filtering Data dataframes , csv , help-database	8	1541	March 18, 2020
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	861	August 31, 2022
Is there a way to stream a table with 1 billion rows from SQL Server to Parquet efficently in Julia New to Julia	1	4066	November 16, 2017

How to read a parquet file applying a filter?

Related topics