I’ve also had issues on the other side, where I could read Arrow data in Julia but not pyarrow. Overall I’ve come to the conclusion that Arrow is just not a very mature format yet, even though it looks promising in some ways.
I experience similar problems with Parquet, not just Arrow. see e.g. LZ4 decompression failed (#48) · Issues · Expanding Man / Parquet2.jl · GitLab
DuckDB directly supports Parquet, so I tried your last file and it can directly be read without any issues:
using QuackIO, Tables
read_parquet(columntable, "mwe2.parquet")
@btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
# 683.894 ms (1810544 allocations: 396.49 MiB)
as pointed out above, parquet + duckdb work which means you could use TidierDB.jl to work on the file directly without reading it into memory
using TidierDB
db = connect(duckdb())
@chain db_table(db, "path_to_mwe2.parquet") begin
### your work here
@collect #to bring it to a local df
end
Here is an example comparing performance to OhMyThreads.jl and Mmap.jl
julia> using QuackIO, Tables, OhMyThreads, Mmap, BenchmarkTools
julia> @btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
893.883 ms (1810628 allocations: 396.49 MiB)
-8513239905662102228
julia> write("x0.bin", collect(Int64, read_parquet(columntable, "mwe2.parquet").x0));
julia> @btime sum(mmap("x0.bin", Vector{Int64}))
120.969 μs (16 allocations: 1.20 KiB)
-8513239905662102228
julia> @btime treduce(+, mmap("x0.bin", Vector{Int64}))
59.583 μs (362 allocations: 30.05 KiB)
-8513239905662102228
I love the explicitness of Julia, DataFrames, and DataFramesMeta and have no plans to go back to Tidy type stuff, but glad people have options.
DuckDB seems great and I like just writing SQL code for more complex queries. I used to use sqldf all the time in R.
I’ve only used arrow stuff occasionally but never had problems with it in Julia. But it was relatively simple stuff, like streaming a CSV from the census through a Julia filter and then streaming it out to an arrow file where I could reaccess the filtered data quickly later.
The DuckDB.jl repository seems to be archived already. Is it working nicely anyway?
The autolinked DuckDB.jl links to the old version. It’s now here duckdb/tools/juliapkg at 6e4f15a3d6cee050ddc94b4bde5d27af4d83097a · duckdb/duckdb · GitHub
What’ s the plan on JDF.jl , especially issues with large dataset like Cannot write DataFrames with more than 2^31 rows · Issue #87 · xiaodaigh/JDF.jl (github.com) ?
I have not mentally recovered sufficiently to continue working. I still want to write something but chances are I wont be looking at github much soon.