How to read a parquet file applying a filter?

In python, we can do:

import pyarrow.parquet as pq
filter = ('some_column', '=', 'some_value')
pq.read_table(file_path, filters=[filter]).to_pandas()

How can I apply a filter similarly when reading a parquet file in Julia?

I’ve tried:

using Parquet
filter = row -> row.some_column == "some_value"
Parquet.read_parquet(filepath, filter=filter)

But I get an error saying that there is no such method:

Closest candidates are:
Parquet.Table(::Any, ::Parquet.File, ::Tables.Schema; rows, batchsize, column_generator, use_threads) got unsupported keyword argument “filter”

But, according to the source of the Parquet package, the filter option should be supported… it says:

filter: Filter function to apply while loading only a subset of partitions from a dataset. The path to the partition is provided as a parameter.

The only thing that I’ve got working is:

using Parquet2

return Parquet2.Dataset(file_path) |> TableOperations.filter(r -> Tables.getcolumn(r, :some_column) == some_value) |> DataFrames.DataFrame

But this is much slower than the Python solution.

Maybe try something like this:

using DuckDB, DataFrames
db = DuckDB.DB()
DuckDB.query(db, "select * from '/path/*.parquet' where some_column = some_value") |>  DataFrames.DataFrame

Yeah, there is a slight but noticeable speedup. Thanks!
:duck: to the rescue.

Do you know how I could read in chunks using DuckDB?
Some parquet files don’t fit in my RAM even after applying the WHERE clause.

Yes it is supported by the duckdb library and the Julia client api. You can see an example here in tests.

The issue is that the Tables partitions interface is not yet supported in the client api. If someone knew the tables partition interface well enough to contribute, it would help.

1 Like

I appreciate the example, but I’m not getting the expected behavior.
I ran the test and it blew my RAM because instead of streaming the result in chunks, it tried to return the 1000000000000 rows in a single chunk.