What's the latest and greatest in data in Julia

Satvik · July 30, 2024, 9:26pm

I’ve also had issues on the other side, where I could read Arrow data in Julia but not pyarrow. Overall I’ve come to the conclusion that Arrow is just not a very mature format yet, even though it looks promising in some ways.

adienes · July 30, 2024, 9:30pm

I experience similar problems with Parquet, not just Arrow. see e.g. LZ4 decompression failed (#48) · Issues · Expanding Man / Parquet2.jl · GitLab

aplavin · July 30, 2024, 11:42pm

DuckDB directly supports Parquet, so I tried your last file and it can directly be read without any issues:

using QuackIO, Tables

read_parquet(columntable, "mwe2.parquet")

@btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
#  683.894 ms (1810544 allocations: 396.49 MiB)

drizk1 · July 31, 2024, 1:07am

as pointed out above, parquet + duckdb work which means you could use TidierDB.jl to work on the file directly without reading it into memory

 using TidierDB
 db = connect(duckdb())

 @chain db_table(db, "path_to_mwe2.parquet") begin
    ### your work here
    @collect #to bring it to a local df
 end

nhz2 · July 31, 2024, 1:14am

Here is an example comparing performance to OhMyThreads.jl and Mmap.jl

julia> using QuackIO, Tables, OhMyThreads, Mmap, BenchmarkTools

julia> @btime read_parquet(columntable, "mwe2.parquet").x0 |> sum
  893.883 ms (1810628 allocations: 396.49 MiB)
-8513239905662102228

julia> write("x0.bin", collect(Int64, read_parquet(columntable, "mwe2.parquet").x0));

julia> @btime sum(mmap("x0.bin", Vector{Int64}))
  120.969 μs (16 allocations: 1.20 KiB)
-8513239905662102228

julia> @btime treduce(+, mmap("x0.bin", Vector{Int64}))
  59.583 μs (362 allocations: 30.05 KiB)
-8513239905662102228

dlakelan · July 31, 2024, 5:36am

I love the explicitness of Julia, DataFrames, and DataFramesMeta and have no plans to go back to Tidy type stuff, but glad people have options.

DuckDB seems great and I like just writing SQL code for more complex queries. I used to use sqldf all the time in R.

I’ve only used arrow stuff occasionally but never had problems with it in Julia. But it was relatively simple stuff, like streaming a CSV from the census through a Julia filter and then streaming it out to an arrow file where I could reaccess the filtered data quickly later.

greatpet · August 1, 2024, 1:12pm

The DuckDB.jl repository seems to be archived already. Is it working nicely anyway?

DanielVandH · August 1, 2024, 1:19pm

The autolinked DuckDB.jl links to the old version. It’s now here duckdb/tools/juliapkg at 6e4f15a3d6cee050ddc94b4bde5d27af4d83097a · duckdb/duckdb · GitHub

xgdgsc · August 15, 2024, 5:12am

What’ s the plan on JDF.jl , especially issues with large dataset like Cannot write DataFrames with more than 2^31 rows · Issue #87 · xiaodaigh/JDF.jl (github.com) ?

xiaodai · August 15, 2024, 12:28pm

I have not mentally recovered sufficiently to continue working. I still want to write something but chances are I wont be looking at github much soon.

Topic		Replies	Views
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7523	August 21, 2020
How is the data ecosystem right now for large datasets? Data	35	6831	July 13, 2017
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	7098	October 25, 2018
DataTables or DataFrames? Data question	32	15491	November 19, 2018
ANN: JuliaDB.jl Community	40	9856	November 13, 2018

What's the latest and greatest in data in Julia

Related topics