What's the latest and greatest in data in Julia

xiaodai · July 20, 2024, 12:17pm

Due to personal circumstances, I’ve been away from Julia (actually open-source in general) for a couple of years so haven’t been catching up with what’s going on in data.

I have seen some TidierData info on x but is otherwise not up to date.

Are you to write down what you think is the latest and greatest in the Julia data scene? What about handling large datasets?

ChrisRackauckas · July 20, 2024, 12:19pm

The Tidier stuff is all really good and should be the go-to these days for most users.

mrufsvold · July 20, 2024, 1:08pm

DuckDB(.jl) seems to be growing in popularity for medium sized data. TidierDB.jl had an excellent interface for it.

Guillawme · July 25, 2024, 7:10am

Have you used the other data manipulation packages? DataFramesMeta.jl and Queryverse.jl come to mind. If you feel inspired to write a short comparison with pros and cons, similar to the one you wrote about plotting packages, I would love to read it!

era127 · July 25, 2024, 1:59pm

DuckDB.jl now also implements the Tables partitions interface for streaming query results on large datasets.

drizk1 · July 25, 2024, 3:31pm

I would love to add this functionality to TidierDB, is there documentation around it?

edit: TidierDB v.3.1 now supports streaming with DuckDB

era127 · July 25, 2024, 4:10pm

Maybe the test cases would be best?

ChrisRackauckas · July 27, 2024, 1:37pm

I’ve used them all at some point, mostly in customer/company contexts. I should write one, but I can tell you the tl;ldr is that when I finally tried the Tidier stuff it was just clean. I didn’t check performance or try to parallelize it, it was more about getting the job done in this context, but it was just really well documented and easy to use so now it has become my go-to.

xiaodai · July 30, 2024, 1:37am

Sounds like the Tidier system is the one to check out. I will check it out.

Nathan_Boyer · July 30, 2024, 2:51pm

I am surprised to see so much love for Tidier from those deeply ingrained in Julia, given that it explicitly avoids Julia conventions for R conventions. I thought there was a lot of strong pushback by Julians for the amount of “magic” in R, such that you don’t ever fully know what an expression is going to do?

I don’t have any experience with R or Tidier, so I don’t have any of my own insight to give. However, I thought people were moving from R to Julia to gain speed and explicitness? Why then go back and make Julia act more like R?

Enlighten me please.

From the top of the TiderData Docs:

Stick as closely to tidyverse syntax as possible.

Whereas other meta-packages introduce Julia-centric idioms for working with DataFrames, this package’s goal is to reimplement parts of tidyverse in Julia. This means that TidierData.jl uses tidy expressions as opposed to idiomatic Julia expressions. An example of a tidy expression is a = mean(b) . In Julia, a and b are variables and are thus “eagerly” evaluated. This means that if b is merely referring to a column in a data frame and not an object in the global namespace, then an error will be generated because b was not found. In idiomatic Julia, b would need to be expressed as a symbol, or :b . Even then, a = mean(:b) would generate an error because it’s not possible to calculate the mean value of a symbol. To handle this using idiomatic Julia, DataFrames.jl introduces a mini-language that relies heavily on the creation of anonymous functions, with explicit directional pairs syntax using a source => function => destination syntax. While this is quite elegant, it can be verbose. TidierData.jl aims to reduce this complexity by exposing an R-like syntax, which is then converted into valid DataFrames.jl code. The reason that tidy expressions are considered valid by Julia in TidierData.jl is because they are implemented using macros. Macros “capture” the expressions they are given, and then they can modify those expressions before evaluating them. For consistency, all top-level dplyr functions are implemented as macros (whether or not a macro is truly needed), and all “helper” functions (used inside of those top-level functions) are implemented as functions or pseudo-functions (functions which only exist through modification of the abstract syntax tree).

Make broadcasting mostly invisible.

Broadcasting trips up many R users switching to Julia because R users are used to most functions being vectorized. TidierData.jl currently uses a lookup table to decide which functions not to vectorize; all other functions are automatically vectorized. Read the documentation page on “Autovectorization” to read about how this works, and how to override the defaults. An example of where this issue commonly causes errors is when centering a variable. To create a new column a that centers the column b , TidierData.jl lets you simply write a = b - mean(b) exactly as you would in R. This works because TidierData.jl knows to not vectorize mean() while also recognizing that - should be vectorized such that this expression is rewritten in DataFrames.jl as :b => (b -> b .- mean(b)) => :a . For any user-defined function that you want to “mark” as being non-vectorized, you can prefix it with a ~ . For example, a function new_mean() , if it had the same functionality as mean() would normally get vectorized by TidierData.jl unless you write it as ~new_mean() .

DataFrames.jl itself has a short comparison between manipulation frameworks. Tidier probably needs to be added there too.

adienes · July 30, 2024, 2:59pm

to be completely blunt, my impression of “Julia in data” is that sample size is pretty tiny across the board compared to other data ecosystems, so I’m not sure that there really is some bigger insight to draw besides that a few users decided to try the Cool New Thing. On my part I’ve repeatedly struggled even just to open my files (Arrow and Parquet) in Julia, and now pretty much exclusively use polars in Python for my dataframe needs.

I’m not saying this to criticize Tidier — I’ve never used it and I’m sure it’s a fine package.

aplavin · July 30, 2024, 3:04pm

Note that in Julia one doesn’t require custom data structures and heavy packages for data processing and analysis. A lot of data manipulation tasks can be achieved with arrays, map and filter that are in Base.

So a direct comparison with python/R that basically require using dataframes for anything is complicated. It’s a bit similar to the question “what is the analogue to numpy in Julia?” – “just use the base language” (:

adienes · July 30, 2024, 3:05pm

manipulation, sure. where Julia significantly struggles is data pipelines, loading, IO, etc.

and unsurprisingly — this is very unsexy stuff to work on so people usually don’t do it for fun. the most robust ecosystems for this functionality are usually built for corporate needs

pdeffebach · July 30, 2024, 3:36pm

One thing that’s new in the past 2 years is Meta Data in columns for DataFrames (and any tabular object that implements the API).

@bkamins did all the work to implement this, but DataFramesMeta.jl added some macros which (I think) make the feature very useable.

julia> df = DataFrame(age = [43, 23, 54], income  = [100, 200, 300]);

julia> @chain df begin
           @label! :age = "Age in years"
           @label! :income = "Annual household income (2020 USD)"
       end;

julia> printlabels(df)
┌────────┬────────────────────────────────────┐
│ Column │                              Label │
├────────┼────────────────────────────────────┤
│    age │                       Age in years │
│ income │ Annual household income (2020 USD) │
└────────┴────────────────────────────────────┘

julia> # Notes for longer information
       @chain df begin
           @note! :income = """
               Income was generated from the 2010 ACS survey, question B4 
           """
       end;

julia> printnotes(df)
Column: age
───────────
Label: Age in years

Column: income
──────────────
Label: Annual household income (2020 USD)
    Income was generated from the 2010 ACS survey, question B4

This brings a lot of Stata-like features to DataFrames.jl and DataFramesMeta.jl.

There is obviously a lot of work to do to incorporate this feature into the rest of the ecosystem, but it’s an excellent foundation.

nhz2 · July 30, 2024, 6:36pm

The new OhMyThreads.jl package combined with the Mmap.jl standard library should be great for handling large amounts of data in a file.

It’s still experimental, but I’m also excited about the potential of GitHub - meggart/DiskArrayEngine.jl for parallel computations on large compressed or cloud datasets.

kdpsingh · July 30, 2024, 7:41pm

I love the addition of metadata to DataFrames.jl and DataFramesMeta.jl.

As one of the Tidier.jl authors, I’ll respond briefly to a couple of broad comments above about the Tidier ecosystem:

While Tidier is inspired by tidyverse, it’s important to note that most of the underlying R tidyverse packages aren’t idiomatic R code. In other words, we are borrowing data transformation conventions from tidyverse but not necessarily from R as a language. This isn’t all that different from Python’s polars, which has also borrowed some conventions from tidyverse, except that Tidier has borrowed much more liberally (made possible by Julia’s flexibility and macros).
Even the “auto-vectorization” (which abstracts away vectorization) is done for a good reason. It means that TidierData code works on TidierDB without major modifications. This would be much harder to do without handling vectorization in TidierData. As an aside, TidierData provides a mechanism to be explicit – nearly all of the “magic” can be overridden.
IMO, the reason some people are moving from R to Julia is that Julia lets you write code at multiple levels of abstraction without having to sacrifice speed. Explicitness is a means to an end in that it helps achieve compiler-friendly Julia code. In R, you’d have to resort to wrapping C++ code to get speedups. What I love about Julia is that you can write both the lower-level and higher-level code in the same language. Tidier is largely aimed at higher-level abstraction users - it’s not antithetical to Julia in that way (at least in my opinion).
The Tidier ecosystem is dealing with some of the data I/O issues head-on by abstracting away the backend components. For example, while TidierData.jl works on data frames, TidierDB.jl works directly on databases and provides good support for DuckDB, which itself has great file I/O support. If you’re dealing with a file type that isn’t well-supported in Julia, it’s usually possible to work with the data format using DuckDB via TidierDB – the data is kept in its original format, Julia code is converted to DuckDB-compatible SQL, and that SQL code is run directly on the file. TidierDB also supports nearly a dozen other database backends.

Just wanted to share our perspective on Tidier.jl. Also, we are working on adding TidierData to the DataFrames.jl frameworks page.

CameronBieganek · July 30, 2024, 8:37pm

Somewhat off-topic, but my impression is that polars has borrowed a lot more from SQL and relational algebra than from the tidyverse. (I use polars a lot at work. I used to use R and the tidyverse at a previous job.)

aplavin · July 30, 2024, 9:10pm

While I wouldn’t call it “significantly struggles”, I noticed such things a few times as well. Sometimes, Julia-native file readers do have issues in less common edgecases. This is not specific to Julia btw, I had issues with pandas as well before.

I also noticed it’s easy to wrap external readers that are presumably wider-used and more robust in edge cases. Just recently, I was playing with “duckdb as tabular reader/writer”, and a simplistic implementation is available at GitHub - JuliaAPlavin/QuackIO.jl.
It provides functions like write_table(file, tbl), read_parquet(Tbl, file), … – the user doesn’t need to know anything about DuckDB.
Turns out, DuckDB-Julia integration is performant enough so that eg QuackIO.read_csv can even be faster than CSV.read.

I’ll probably clean QuackIO up (the main issue is that now it blindly interpolates SQL strings) and register at some point…
UPD: QuackIO.jl is registered!

adienes · July 30, 2024, 9:21pm

can I use DuckDB.jl or QuackIO.jl to read this file?

github.com/apache/arrow-julia

Failure to read valid file

opened 06:57PM - 26 Jul 24 UTC

adienes

both `pyarrow` and `polars` can read this table, but [mwe.arrow.zip](https://gi…thub.com/user-attachments/files/16395769/mwe.arrow.zip) ``` julia> Arrow.Table("mwe.arrow") 1-element ExceptionStack: TaskFailedException nested task error: MethodError: no method matching init(::Nothing, ::Vector{UInt8}, ::Int64) Closest candidates are: init(::Type{T}, ::Vector{UInt8}, ::Integer) where T<:Union{Arrow.FlatBuffers.Struct, Arrow.FlatBuffers.Table} @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/FlatBuffers/table.jl:43 Stacktrace: [1] getproperty(x::Arrow.Flatbuf.Field, field::Symbol) @ Arrow.Flatbuf ~/.julia/packages/Arrow/5pHqZ/src/metadata/Schema.jl:542 [2] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool) @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/table.jl:668 [3] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64}) @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/table.jl:629 [4] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator) @ Base ./abstractarray.jl:948 [5] _collect @ ./array.jl:765 [inlined] [6] collect @ ./array.jl:759 [inlined] [7] macro expansion @ ~/.julia/packages/Arrow/5pHqZ/src/table.jl:526 [inlined] [8] (::Arrow.var"#102#108"{Bool, Channel{Any}, ConcurrentUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch, Int64})() @ Arrow ~/.julia/packages/ConcurrentUtilities/QOkoO/src/ConcurrentUtilities.jl:48 Stacktrace: [1] sync_end(c::Channel{Any}) @ Base ./task.jl:448 [2] macro expansion @ ./task.jl:480 [inlined] [3] Arrow.Table(blobs::Vector{Arrow.ArrowBlob}; convert::Bool) @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/table.jl:441 [4] Table @ ~/.julia/packages/Arrow/5pHqZ/src/table.jl:415 [inlined] [5] Table @ ~/.julia/packages/Arrow/5pHqZ/src/table.jl:407 [inlined] [6] Arrow.Table(input::String) @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/table.jl:407 [7] top-level scope @ REPL[4]:1 ```

I would definitely call it a “significant struggle.” This is the third or fourth time over the past year I’ve just been fully unable to read my tables and had to give up until so-and-so fix is in, and I refuse to believe that somehow I am unlucky enough to only encounter “less common edgecases”

I am not saying it is specific to Julia. but there is a reason I mostly use polars now.

Satvik · July 30, 2024, 9:25pm

I don’t have any experience with R or Tidier, so I don’t have any of my own insight to give. However, I thought people were moving from R to Julia to gain speed and explicitness? Why then go back and make Julia act more like R?

In terms of explicitness…if you’re doing a lot of ad-hoc queries in a notebook, you want really concise syntax. If you’re adding a feature to a 30,000 line codebase, you want explicitness. So when I’m querying, I use DataFramesMeta, but when I’m checking in code I usually use raw DataFrames

Topic		Replies	Views
[ANN] (Belatedly) Announcing Tidier.jl Package Announcements package , announcement , dataframes	26	5132	March 26, 2025
A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data Offtopic	24	854	September 16, 2024
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7434	August 21, 2020
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4160	April 8, 2024
Struggling with Julia and large datasets General Usage question , big-data	67	11076	October 17, 2024

What's the latest and greatest in data in Julia

Related topics