What's the current (spring 2024) canonical approach to data science in Julia?

It is not a wrapper around the R package. It is a reimplementation of the R TidyVerse in Julia:

10 Likes

You could use duckdb for data analytics and it would make it easy to switch between languages.

That will probably be a thing of the past in 0.21. With Unit support for Axes & Recipes, a.k.a axis converts by SimonDanisch · Pull Request #3226 · MakieOrg/Makie.jl · GitHub :

using CairoMakie, Dates

lines(now() .+ Millisecond.(1:100), cumsum(randn(100)))

(The ticks aren’t great but that’s a different issue)

23 Likes

:exploding_head:

4 Likes

Personally, I use ggplot2 for plotting in jupyter notebooks via GitHub - JuliaInterop/RCall.jl: Call R from Julia

2 Likes

Also note that unlike Python and others, you don’t really need any fancy specialized data structures for tabular data. So, using stuff like dataframes is completely optional!

For basic tabular operations, you don’t need any dependencies at all: vector-of-namedtuples is already a table, functions like map and filter are present in Julia Base.
Such a table is fully-featured, convenient to work with, can be read/written to a file with any tabular IO packages (eg csv).

For more advanced operations, like groups or joins, there are packages to help. See SplitApplyCombine.jl (older & more popular), DataManipulation.jl (newer, faster, more extensive), FlexiJoins.jl (the most versatile join function to my knowledge, not just among Julia alternatives).

As a bonus, these data structures and functions take you much further than flat tables, with the exact same familiar interface.

7 Likes

I just want to add InteractiveViz.jl which is based on Makie and is very useful when you want to visualize a large number of points

4 Likes

In addition to DataFrames.jl and Makie.jl as several people already have mentioned, I like to use Transducers.jl for data wrangling. It takes some time to get used to, but makes it very easy to write reusable code for data transformations. I find the MapCat transducers particularly useful.

5 Likes

Super exciting thread!

DataFrames is enough for me, also use RCall for missing Julia library functionality (like a well-tested batch correction algorithm!). Pluto + WGLMakie for data interaction, sometimes needs a reload, but is LOVELY when it works, you can figure out what your outliers are by mouse hover (DataInspector)!

Despite the downsides, Ive been using Pluto more and more because the reactivity means that once the script is right, you have a rock solid step in your pipeline with minimal statefulness. You MUST split off the expensive calculations and save the results to avoid locking on rerunning the big steps, but this is good hygiene anyway.

Id love to have a slightly more flexible reactivity and mutation, but Pluto is STILL worth it to me.

3 Likes

For everyone using Pluto and bummed about long running operations, it is possible to disable cells so that they don’t run with every change. Then when you need to, just enable the cell and reactivity will be turned back on. I’m sure many of you are aware of that already, but I thought I should mention it just in case.

4 Likes

I work in a large corporation using Julia daily for complex data manipulation daily. I use the usual DataFrames.jl, Makie.jl, etc for everyday tasks, but something perhaps unique is using AlgebraicJulia/ACSets.jl: ACSets: Algebraic databases as in-memory data structures (github.com) and sometimes the additional power of AlgebraicJulia/Catlab.jl: A framework for applied category theory in the Julia language (github.com) for manipulation of data that would otherwise be stored in lots of individual data frames. Being able to define a schema and algorithms on the schema level, knowing they will work with any particular data instantiations has been quite helpful.

2 Likes

As the author of Tidier.jl, I’ll just add that I generally agree with the comments above. However, we are aiming to make the package useful for Julia users at large and more than just former R users. I’ve personally used many other Julia and non-Julia frameworks and believe that tidyverse strikes a good balance between simplicity, functionality, and consistency across data analysis and plotting. Even though we have adopted some tidyverse defaults in Tidier, we allow for some of that to be tailored, and we take advantage of Julia functionality that couldn’t be as easily accomplished in the R version.

One of our current directions is to make Tidier code work on multiple backends through the TidierDB package. It will work with SQLite, PostgresSQL, DuckDB, and more. Fairly soon, the goal is to make it so that TidierData code will work nearly identically across data frames and databases.

So I wouldn’t write off Tidier as only geared towards R users, but I have used and benefited from DataFrames and DataFramesMeta, and Query.jl and understand the appeal of each.

Our README addresses some of this: GitHub - TidierOrg/Tidier.jl: Meta-package for data analysis in Julia, modeled after the R tidyverse.

And I hope to put more work into our course consisting of Pluto notebooks.

17 Likes

Interesting. Do you have an example that you can share of how you solve a data science task with those tools?

1 Like

These sound interesting, but docs are quite involved. Do you happen to have simple examples showcasing this algebraic approach?

2 Likes

@aplavin and @simsurace, yes, I’m trying to write a blog post after filtering out sensitive stuff. For now, it’s been very helpful to have essentially an in-memory database that I can store JuMP VariableRefs in to bring my data and optimization models together in a very natural way. Being able to migrate data between schemas is also essential for processing of output.

@aplavin I’d start with these 2 blog posts: AlgebraicJulia blog - Graphs and C-sets I: What is a graph? and then for some examples of conjunctive queries, AlgebraicJulia blog - C-sets for data analysis: relational data and conjunctive queries

3 Likes