What's the latest and greatest in data in Julia

I’ve also had issues on the other side, where I could read Arrow data in Julia but not pyarrow. Overall I’ve come to the conclusion that Arrow is just not a very mature format yet, even though it looks promising in some ways.

I experience similar problems with Parquet, not just Arrow. see e.g. LZ4 decompression failed (#48) · Issues · Expanding Man / Parquet2.jl · GitLab

1 Like

DuckDB directly supports Parquet, so I tried your last file and it can directly be read without any issues:

using QuackIO, Tables

read_parquet(columntable, "mwe2.parquet")

@btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
#  683.894 ms (1810544 allocations: 396.49 MiB)
7 Likes

as pointed out above, parquet + duckdb work which means you could use TidierDB.jl to work on the file directly without reading it into memory

 using TidierDB
 db = connect(duckdb())

 @chain db_table(db, "path_to_mwe2.parquet") begin
    ### your work here
    @collect #to bring it to a local df
 end
3 Likes

Here is an example comparing performance to OhMyThreads.jl and Mmap.jl

julia> using QuackIO, Tables, OhMyThreads, Mmap, BenchmarkTools

julia> @btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
  893.883 ms (1810628 allocations: 396.49 MiB)
-8513239905662102228

julia> write("x0.bin", collect(Int64, read_parquet(columntable, "mwe2.parquet").x0));

julia> @btime sum(mmap("x0.bin", Vector{Int64}))
  120.969 μs (16 allocations: 1.20 KiB)
-8513239905662102228

julia> @btime treduce(+, mmap("x0.bin", Vector{Int64}))
  59.583 μs (362 allocations: 30.05 KiB)
-8513239905662102228
4 Likes

I love the explicitness of Julia, DataFrames, and DataFramesMeta and have no plans to go back to Tidy type stuff, but glad people have options.

DuckDB seems great and I like just writing SQL code for more complex queries. I used to use sqldf all the time in R.

I’ve only used arrow stuff occasionally but never had problems with it in Julia. But it was relatively simple stuff, like streaming a CSV from the census through a Julia filter and then streaming it out to an arrow file where I could reaccess the filtered data quickly later.

4 Likes

The DuckDB.jl repository seems to be archived already. Is it working nicely anyway?

1 Like

The autolinked DuckDB.jl links to the old version. It’s now here duckdb/tools/juliapkg at 6e4f15a3d6cee050ddc94b4bde5d27af4d83097a · duckdb/duckdb · GitHub

9 Likes

What’ s the plan on JDF.jl , especially issues with large dataset like Cannot write DataFrames with more than 2^31 rows · Issue #87 · xiaodaigh/JDF.jl (github.com) ?

1 Like

I have not mentally recovered sufficiently to continue working. I still want to write something but chances are I wont be looking at github much soon.

6 Likes