What's the current (spring 2024) canonical approach to data science in Julia?

I have been using Julia for building, solving, and simulating computational models for a few years now, but my empirical (“data science”) work has remained in legacy languages – Python for data wrangling and plotting, Stata for proper econometrics.

I want to see if I can transition to Julia for those tasks as well, partly in search of one-language elegance and partly because there are things I dislike about both of those alternatives (pandas syntax and constant changes are infuriating, while Stata is proprietary and essentially anti-thetical to well-written modular code).

But while I’ve gotten generally familiar with DataFrames.jl, I’m not sure what kind of workflow I should develop and what auxillary packages I should invest my time in learning.

Should I use base DataFrames.jl? DataFramesMeta.jl? Tidier.jl? What do most people use these days?

My current Python workflow is usually – play around with stuff in a Jupyter notebook, then move backend-type code into py files while continuing to use Jupyter as a frontend.

Should I replicate this workflow exactly with Julia or is there a better alternative? Should I use Jupyter or Pluto.jl?

I understand that my questions are asking for opinions, and one may just be tempted to answer, “try out all approaches and see what you’re most comfortable with.” But I’d like to speed the process along by learning the workflow that most others use. I also know that there are similar threads on here from a few years ago, but my understanding is that the toolkit has evolved substantially, so those threads may be outdated.

Thanks in advance!

7 Likes

I don’t think things have evolved massively. I’d say Tidier.jl is great if you come from R and want to ease the transition but that doesn’t seem to apply to you. DataFramesMeta is quite popular I believe, although I stick to plain vanilla DataFrames pretty much exclusively.

On Jupyter/Pluto I find the reactivity of Pluto often a hindrance for Data Sciency workflows (where I often re-use variable names, and accidental recomputation of a computationally expensive step is quite painful).

8 Likes

I do plenty of “data science” with DataFramesMeta.jl (I’m biased, I’m a maintainer of DataFramesMeta.jl), a vanilla Terminal and Sublime Text. I think I’m decently productive. Scripts are fine.

For graphing I use AlgebraOfGraphics.jl. I output all results into LaTeX, and use PrettyTables.jl extensively.

For exploring data, I rely on TerminalPager.jl as well as FloatingTableView.jl (which I maintain).

I think it’s a decent stack and it gets the job done well.

I also use a few helper packages extensively for data cleaning (many of which I maintain, and I don’t think have that much use outside of me)

  1. Missing.jl for handling missing data easier
  2. MissingsAsFalse.jl, also for handling missing data
  3. AddToField.jl, for making named tuples easier, as well as miscellaneous data cleaning tasks.
  4. ClipData.jl, for copying things in from excel (really useful for interactive work)
13 Likes

I’ve come to rely on DataFramesMeta.jl heavily…it’s almost embarrassing at this point because I can’t get anything done without it :grin:. I’ve also been using Pluto.jl more and more as time has gone by because I find it very quick and easy to fire up a notebook and start writing code, and I really like its Live Docs feature. You do have to be careful though, as Nils mentioned, about the reactivity and computation times. If I know up front that parts of my code are going to take a while to run, I will generally go the VS Code route from the start. I do think it’s fairly common for Julia folks to use Jupyter notebooks though, so if you like them and are used to them, I would say stick with it. For plotting, I mostly stick to StatsPlots.jl and then I might use Makie if I need more control but I have a JavaScript background so I also still turn to JS plotting solutions when I want something really custom or really fancy.

3 Likes

Like @nilshg, I tend to use vanilla DataFrames, but there’s a good reason that the tidyverse in R has such strong adherents, and I think the work that’s going into Tidier.jl is remarkable. I keep thinking I’m going to try to take it up, but since it will damage my productivity in the short term, I haven’t yet.

I tend to use “notebooks”, but in the form of “literate” code (see eg Literate.jl). In practice, this often just amounts to a linear script that has block markers that I use to navigate around in. In principle, I could convert these to proper Jupiter notebooks if I wanted to, but I rarely do.

One thing that people haven’t mentioned is Projects - managing your package environment is really important to get started on early. If you use conda or virtual envs in python, it’s similar (but much much much much nicer in Julia). I usually set up my projects like a Julia package, and stick code I reuse a lot into the package module, but this can be overly complicated. At base, you really just need a directory, and then do ] activate . before you get started. Then, when you add packages, they get saved to Project.toml and Manifest.toml that make your environment reproducible.

Finally, welcome to the community! It’s a great place to be :blush:

13 Likes

I also find myself relying on DataFramesMeta.jl, I did not come from R so Tider.jl and others did not feel familiar.

For tools, I almost exclusively use VS Code since it has built-in data exploration tools and I use GitHub heavily for syncing code across computers, so standard scripts seem to flow more nicely. My “standard workflow” is I create a new project with a “src” and “scripts” folder and activate the project (as @kevbonham mentions). In the “scripts” folder I will have my main script, which also uses Revise.jl and includes (via includet) files in the “src” folder. For example, I will have “src/functions.jl” which I add to over time and using Revise.jl means I can make changes or add functions anytime without having to rerun anything in the main script.

For plotting, I rely on Gadfly.jl, largely because I like exporting it to PGF and Latex, though I occasionally look at AlgebraOfGraphics.jl as well.

1 Like

Thanks, all. Very helpful. I’m familiar with basic Julia workflows (i.e., project, environment, Revise) but I appreciate that this thread may be useful to those new to both data science in Julia as well as Julia itself.

A few follow up questions, if I may.

@pdeffebach – the missing-related packages that you use: do they somehow allow you to avoid typing skipmissing in every data-wrangling operation? I find syntax like this kind of ugly and tedious:

dfc = @chain df begin
	groupby(:date)
	combine(:price .=> x -> mean(skipmissing(x)), renamecols=false)
end

I’ve also noticed that Pluto.jl’s reactivity gets annoying with large datasets and the restrictions on one command per cell and no variable name reuse make it better for developing a polished front-end for an existing application than for exploring data.

I’m used to VS Code for both Julia and my Python+Jupyter workflows, so maybe I’ll just stick to that. IJulia doesn’t play as nicely with VS Code’s Jupyter extension as Python does, but as long as I can get plotting to work well when using a REPL + scratch file workflow, that’s fine.

@mthelm85 Your preference for Makie over Plots.jl seems to be a common one these days. I’ve used Plots in the past for non-data science applications. Do you prefer Makie because it’s better for data science or because it’s better overall (and therefore I should start using it in my other work too)?

I appreciate the recommendations of others for Gadly and AlgebraOfGraphics but these “graphics languages” seem really counter-intuitive to me. I can see how they’re useful for building complex visualizations but “linear” plotting makes more sense to me for exploration.

Finally, when working with data, I find needing to interact with plots quite often – sometimes by zooming and panning, other times by quickly creating interactive interfaces. The latter is very easy in Python with ipywidgets. Is there something similar in Julia? And for displaying and interacting with plots in VS Code, what’s the best backend? Plotly?

1 Like

Two points:

  1. The Julia VSCode extension natively sports jupyter notebooks these days, so no need for IJulia (although imo the experience is still not as nice as just using IJulia in the browser)

  2. I’ve been trying to transition to Makie for years but in practice nearly all data I use has a time dimension and the fact that Makie doesn’t have date support so I can’t just do scatter(df.date, df.price) or whatever turns out to be severely limiting.

4 Likes

With regards to skipmissing… yeah its a PITA.

dfc = @chain df begin
	groupby(:date)
	@combine :price = mean(skipmissing(:price))
end

is a bummer to do all the time. There have been endless online discussions about this with no real consensus reached. I don’t have much to say beyond that. Potentially DataFramesMeta.jl will introduce some features to automatically skip missing in the future, but this will still require some typing on the user’s part.

2 Likes

I agree that it is ugly and tedious

there have been many discussions related missing handling, and it seems unfortunately “the people” writ large like it that way. so I would not expect any changes there anytime soon

regarding Plots vs Makie — I think Makie is more or less just “better” with the only caveat that it has a stricter learning curve. but once you get some familiarity it’s a very nice plotting library with lots of power

3 Likes

On a similar note, AlgebraOfGraphics.jl basically lacks any kind of missing values support. Part of this is is bad argument handling on AlgebraOfGraphics.jl’s side of things, since Makie doesn’t have the same issues, but a lot of it is also upstream. Histograms, Loess etc. A concerted effort needs to be made by various parties to get it up to speed.

I am pretty happy with TableTransforms.jl + PairPlots.jl It is really easy to create data science pipelines and visualize the multivariate distributions. The scripts are clean, readable, and you can easily reuse pipelines elsewhere later.

2 Likes

I know a lot of these packages have already been mentioned, but here are some considerations for my various workflows in data analysis:

Otherwise, getting into the weeds of workflows is tricky in the sense that one can do them in multiple different ways here in Julia. Plus, what workflows you want to create is a very open-ended thing (dashboards? plotting recipes? accessing databases? data harmonization?) that lends itself well ot a bigger discussion. This Julia Data Science text by Jose Storopoli might also be interesting. I hope this helps give some additional thoughts!

6 Likes

I didn’t know about this package. I’m curious to know if:
Does this package allow you to do all the operations you can do with DataFrames, for example?
What are the main differences (besides the reversibility of the transformations) in terms of functions and performance with DF?

To ask a more precise question, is there a function equivalent to groupby()?

It has most, if not all operations. We use it daily to handle arbitrary Tables.jl tables, including DataFrames.jl.

You create composable pipelines that work with normal tables, but also with other more sophisticated table types such as GeoTable. See Part II of Geospatial Data Science with Julia

Other packages provide the split-apply-combine pattern with general Tables.jl, for example GitHub - JuliaData/SplitApplyCombine.jl: Split-apply-combine strategies for Julia

After you get used to TableTransforms.jl it is very hard to come back to other approaches. The main challenge right now is that we perform operations by copying the results. If you need mutation of extremely large (Gb of data) dataframes, then a custom solution with DataFrames.jl may be more adequate.

1 Like

A few other tools for reproducible data science are listed here:

https://modernjuliaworkflows.github.io/sharing/#reproducibility

2 Likes

You’ll find many packages mentioned on every site, and I imagine you could easily feel overwhelmed if you’re new. In case you want to know which ones are the “standard” packages (in the sense of used by more people, rather than my personal preferences)

FOR DATA MANIPULATION:
Dataframes → the standard package for working with tabular data. It’s by far the most used.

DataframesMeta → note this is an extension of Dataframes. The main goal is to provide a simpler syntax for Dataframes. Nonetheless, this also implies that the syntax doesn’t resemble Julia’s style, which can be inconvenient if you’re new to Julia and learning its style.

Tidier → A w̶r̶a̶p̶p̶e̶r̶ reimplementation of the popular R package. Notice that this is relatively new to the Julia ecosystem. I don’t think it still reached the point of considering “standard”.

FOR PLOTTING
Plots and StatsPlots → Both are the standard packages to draw plots. Note that StatsPlots is actually an extension of Plots. It’s identical to Plots, but it also incorporates other types of graphs that are more common in Statistics. Both packages provide a Julia-like syntax for creating plots using various backend engines (e.g., Latex).

Makie → From what I’ve been noticing, this is slowly becoming the new standard for drawing graphs. I got used to StatsPlots, but I’d be torn between StatsPlots and Makie these days if I had to learn one from scratch.

1 Like

While I do agree that “no real consensus was reached” (for a strong definition of consensus). I think you and me got different impressions from the discussion in this thread. I think the likes tell a story of the silent majority agreeing with the current behaviour.

majority of existing users

which are self-selected by those not put off entirely by the missing behavior

1 Like

Yes, as such is true for any feature of any language.