Due to personal circumstances, I’ve been away from Julia (actually open-source in general) for a couple of years so haven’t been catching up with what’s going on in data.
I have seen some TidierData info on x but is otherwise not up to date.
Are you to write down what you think is the latest and greatest in the Julia data scene? What about handling large datasets?
Have you used the other data manipulation packages? DataFramesMeta.jl and Queryverse.jl come to mind. If you feel inspired to write a short comparison with pros and cons, similar to the one you wrote about plotting packages, I would love to read it!
I’ve used them all at some point, mostly in customer/company contexts. I should write one, but I can tell you the tl;ldr is that when I finally tried the Tidier stuff it was just clean. I didn’t check performance or try to parallelize it, it was more about getting the job done in this context, but it was just really well documented and easy to use so now it has become my go-to.
I am surprised to see so much love for Tidier from those deeply ingrained in Julia, given that it explicitly avoids Julia conventions for R conventions. I thought there was a lot of strong pushback by Julians for the amount of “magic” in R, such that you don’t ever fully know what an expression is going to do?
I don’t have any experience with R or Tidier, so I don’t have any of my own insight to give. However, I thought people were moving from R to Julia to gain speed and explicitness? Why then go back and make Julia act more like R?
Whereas other meta-packages introduce Julia-centric idioms for working with DataFrames, this package’s goal is to reimplement parts of tidyverse in Julia. This means that TidierData.jl uses tidy expressions as opposed to idiomatic Julia expressions. An example of a tidy expression is a = mean(b) . In Julia, a and b are variables and are thus “eagerly” evaluated. This means that if b is merely referring to a column in a data frame and not an object in the global namespace, then an error will be generated because b was not found. In idiomatic Julia, b would need to be expressed as a symbol, or :b . Even then, a = mean(:b) would generate an error because it’s not possible to calculate the mean value of a symbol. To handle this using idiomatic Julia, DataFrames.jl introduces a mini-language that relies heavily on the creation of anonymous functions, with explicit directional pairs syntax using a source => function => destination syntax. While this is quite elegant, it can be verbose. TidierData.jl aims to reduce this complexity by exposing an R-like syntax, which is then converted into valid DataFrames.jl code. The reason that tidy expressions are considered valid by Julia in TidierData.jl is because they are implemented using macros. Macros “capture” the expressions they are given, and then they can modify those expressions before evaluating them. For consistency, all top-level dplyr functions are implemented as macros (whether or not a macro is truly needed), and all “helper” functions (used inside of those top-level functions) are implemented as functions or pseudo-functions (functions which only exist through modification of the abstract syntax tree).
Make broadcasting mostly invisible.
Broadcasting trips up many R users switching to Julia because R users are used to most functions being vectorized. TidierData.jl currently uses a lookup table to decide which functions not to vectorize; all other functions are automatically vectorized. Read the documentation page on “Autovectorization” to read about how this works, and how to override the defaults. An example of where this issue commonly causes errors is when centering a variable. To create a new column a that centers the column b , TidierData.jl lets you simply write a = b - mean(b) exactly as you would in R. This works because TidierData.jl knows to not vectorize mean() while also recognizing that -should be vectorized such that this expression is rewritten in DataFrames.jl as :b => (b -> b .- mean(b)) => :a . For any user-defined function that you want to “mark” as being non-vectorized, you can prefix it with a ~ . For example, a function new_mean() , if it had the same functionality as mean()would normally get vectorized by TidierData.jl unless you write it as ~new_mean() .
to be completely blunt, my impression of “Julia in data” is that sample size is pretty tiny across the board compared to other data ecosystems, so I’m not sure that there really is some bigger insight to draw besides that a few users decided to try the Cool New Thing. On my part I’ve repeatedly struggled even just to open my files (Arrow and Parquet) in Julia, and now pretty much exclusively use polars in Python for my dataframe needs.
I’m not saying this to criticize Tidier — I’ve never used it and I’m sure it’s a fine package.
Note that in Julia one doesn’t require custom data structures and heavy packages for data processing and analysis. A lot of data manipulation tasks can be achieved with arrays, map and filter that are in Base.
So a direct comparison with python/R that basically require using dataframes for anything is complicated. It’s a bit similar to the question “what is the analogue to numpy in Julia?” – “just use the base language” (:
manipulation, sure. where Julia significantly struggles is data pipelines, loading, IO, etc.
and unsurprisingly — this is very unsexy stuff to work on so people usually don’t do it for fun. the most robust ecosystems for this functionality are usually built for corporate needs
One thing that’s new in the past 2 years is Meta Data in columns for DataFrames (and any tabular object that implements the API).
@bkamins did all the work to implement this, but DataFramesMeta.jl added some macros which (I think) make the feature very useable.
julia> df = DataFrame(age = [43, 23, 54], income = [100, 200, 300]);
julia> @chain df begin
@label! :age = "Age in years"
@label! :income = "Annual household income (2020 USD)"
end;
julia> printlabels(df)
┌────────┬────────────────────────────────────┐
│ Column │ Label │
├────────┼────────────────────────────────────┤
│ age │ Age in years │
│ income │ Annual household income (2020 USD) │
└────────┴────────────────────────────────────┘
julia> # Notes for longer information
@chain df begin
@note! :income = """
Income was generated from the 2010 ACS survey, question B4
"""
end;
julia> printnotes(df)
Column: age
───────────
Label: Age in years
Column: income
──────────────
Label: Annual household income (2020 USD)
Income was generated from the 2010 ACS survey, question B4
This brings a lot of Stata-like features to DataFrames.jl and DataFramesMeta.jl.
There is obviously a lot of work to do to incorporate this feature into the rest of the ecosystem, but it’s an excellent foundation.
The new OhMyThreads.jl package combined with the Mmap.jl standard library should be great for handling large amounts of data in a file.
It’s still experimental, but I’m also excited about the potential of GitHub - meggart/DiskArrayEngine.jl for parallel computations on large compressed or cloud datasets.
I love the addition of metadata to DataFrames.jl and DataFramesMeta.jl.
As one of the Tidier.jl authors, I’ll respond briefly to a couple of broad comments above about the Tidier ecosystem:
While Tidier is inspired by tidyverse, it’s important to note that most of the underlying R tidyverse packages aren’t idiomatic R code. In other words, we are borrowing data transformation conventions from tidyverse but not necessarily from R as a language. This isn’t all that different from Python’s polars, which has also borrowed some conventions from tidyverse, except that Tidier has borrowed much more liberally (made possible by Julia’s flexibility and macros).
Even the “auto-vectorization” (which abstracts away vectorization) is done for a good reason. It means that TidierData code works on TidierDB without major modifications. This would be much harder to do without handling vectorization in TidierData. As an aside, TidierData provides a mechanism to be explicit – nearly all of the “magic” can be overridden.
IMO, the reason some people are moving from R to Julia is that Julia lets you write code at multiple levels of abstraction without having to sacrifice speed. Explicitness is a means to an end in that it helps achieve compiler-friendly Julia code. In R, you’d have to resort to wrapping C++ code to get speedups. What I love about Julia is that you can write both the lower-level and higher-level code in the same language. Tidier is largely aimed at higher-level abstraction users - it’s not antithetical to Julia in that way (at least in my opinion).
The Tidier ecosystem is dealing with some of the data I/O issues head-on by abstracting away the backend components. For example, while TidierData.jl works on data frames, TidierDB.jl works directly on databases and provides good support for DuckDB, which itself has great file I/O support. If you’re dealing with a file type that isn’t well-supported in Julia, it’s usually possible to work with the data format using DuckDB via TidierDB – the data is kept in its original format, Julia code is converted to DuckDB-compatible SQL, and that SQL code is run directly on the file. TidierDB also supports nearly a dozen other database backends.
Just wanted to share our perspective on Tidier.jl. Also, we are working on adding TidierData to the DataFrames.jl frameworks page.
Somewhat off-topic, but my impression is that polars has borrowed a lot more from SQL and relational algebra than from the tidyverse. (I use polars a lot at work. I used to use R and the tidyverse at a previous job.)
While I wouldn’t call it “significantly struggles”, I noticed such things a few times as well. Sometimes, Julia-native file readers do have issues in less common edgecases. This is not specific to Julia btw, I had issues with pandas as well before.
I also noticed it’s easy to wrap external readers that are presumably wider-used and more robust in edge cases. Just recently, I was playing with “duckdb as tabular reader/writer”, and a simplistic implementation is available at GitHub - JuliaAPlavin/QuackIO.jl.
It provides functions like write_table(file, tbl), read_parquet(Tbl, file), … – the user doesn’t need to know anything about DuckDB.
Turns out, DuckDB-Julia integration is performant enough so that eg QuackIO.read_csv can even be faster than CSV.read.
I’ll probably clean QuackIO up (the main issue is that now it blindly interpolates SQL strings) and register at some point… UPD: QuackIO.jl is registered!
can I use DuckDB.jl or QuackIO.jl to read this file?
I would definitely call it a “significant struggle.” This is the third or fourth time over the past year I’ve just been fully unable to read my tables and had to give up until so-and-so fix is in, and I refuse to believe that somehow I am unlucky enough to only encounter “less common edgecases”
I am not saying it is specific to Julia. but there is a reason I mostly use polars now.
I don’t have any experience with R or Tidier, so I don’t have any of my own insight to give. However, I thought people were moving from R to Julia to gain speed and explicitness? Why then go back and make Julia act more like R?
In terms of explicitness…if you’re doing a lot of ad-hoc queries in a notebook, you want really concise syntax. If you’re adding a feature to a 30,000 line codebase, you want explicitness. So when I’m querying, I use DataFramesMeta, but when I’m checking in code I usually use raw DataFrames