Please recommend a Julia ecosystem for Statistics

I thought about using Julia for statistics and was used to using R with tidyverse. So I thought to check out Julia and some packages looking the most like R and tidyverse and maybe migrate. So for now, as I play around with it, I use DataFrames, DataFramesMeta, Gadfly, CSV, Statistics (standard package), RDatasets. (I also tried JLD for saving workspaces, but somehow JLD did not install correctly, so I just let it alone for now. I also tried Cairo and Fontconfig, but somehow it looks like they hinder Gadfly.plot().)

What is the combination of Julia and packages that most resembles R and tidyverse, or is there a better recommendation?

I read about Plots.jl, and thought about using it with GR or PlotlyJS (and maybe even with Unicodeplots), but come and please advise me. The idea of having Plots.jl as master package to control other packages sounds very good to me, but two questions arise in my mind:
(1) can you control Gadfly with it?
(2) is it stable?

Maybe even forget about Gadfly?

Are you aware of Queryverse.jl?

1 Like

I think DataFrames and DataFramesMeta is a very good user experience and things map very well to R. I use that combo when I do data analysis in Julia.

Feather.jl is a nice package for saving and loading datasets, plus you get great interoperability with R for free.

2 Likes

Queryverse, crstnbr, is something I did come across, I think, but somehow I was being attracted to the packages I mentioned earlier. It looks interesting. What exactly does Queryverse pull in? I suppose you would say that Queryverse is the most direct equivalent to R and tidyverse? Is it?

I like DataFrames and DataFramesMeta too, genauguy, and have gotten used to it in the short time I’ve been using Julia. Feather sounds interesting. What’s the catch?

I think I am drowning in options. Someone like me in the case of Julia shouldn’t have all these options. I need one good, one best, option that defeats all other options, if there are other options. I want to get cracking. So what is that option, and is there any good recent book to go with it?

EDIT
Maybe I should also ask what will most likely be the most common Julia ecosystem in the future for these kinds of things. If you look at the way things are, is it reasonable to think that Queryverse is going to be it?

I doubt things will settle on a single ecosystem of packages, since different people have different needs, but surely Queryverse is a safe bet in the long-term for a full-featured, very interoperable data science ecosystem.

1 Like

That’s almost impossible to know. The Tidyverse became the de facto standard un R because Hadley is great at promoting his packages, wrote blogs, books, etc. And created an ecosystem that made sense. Julia’s is still young and a lot of things are in flux, so you should try and try and see what do you like. And, ideally, help improve packages or documentation with PR’s. For graphs I’ve used Plots, Makie, Vega-Lite, Gadfly and others, depending on the situation. I really like the Distributions package and hope a PR to use keywords gets merged. I like GLM and MixedModels. And I haven’t touched R for a while now. If you want to do the jump, commit to it and don’t try to replicate what you had in R in Julia, because that won’t work.

4 Likes

Unfortunately, as others have said, this is unlikely. Or rather, it is unlikely that what is best for you will be best for everyone. Lots of people love the queryverse, and it does a ton of things well. I used it for a little while, but it doesn’t quite mesh with the way my brain works, so it’s not best for me.

I actually use mostly the methods built into DataFrames rather than using Query or DFMeta. For statistics stuff, I’d look at the JuliaStats organization, most of the stuff in there is likely to be well supported, though it depends on your needs. And Plots.jl continues to be awesome, though a lot of the new development is going into Makie (I still use Plots for everything). To answer one of your questions, I think gadfly used to be supported as one of the back ends, but I don’t think that’s true anymore.

One of the things that’s great in this community is that there are lots of things being explored. And, where possible, people put in a bunch of to make things compatible. So Tables.jl for example tries to provide a common interface that let’s you move between different table representations so you’re never locked into one way of doing something.

Yes, it’s a bit harder as a new user to know what’s going to work best for you personally, but keep coming here and asking questions. There are plenty of people interested in being helpful :wink:

4 Likes

What I meant when I wrote about the best option was not meant as a personal thing. It was meant as, "Look, there are things out there and they have real properties. There are also what you might call for the moment ideals and how much something is near that ideal or ideals (which does not depend on human subjectivity). There is also what you might describe as ‘Simply put and according to how this particular package works, this will give you the greatest results with the least effort and covers a whole lot,’ or this particular combination is all you need to get cracking and to get cracking quickly.’ " I meant it in the sense of things being objective. This also removes any idea of wanting things the R way or shoehorning things.

I don’t necessarily want to have things the R way. After all, in this case I’ve used and am using Julia. But if things were more like it if it’s packages, not that it has to be, that might help speed things up. But the above paragraph is pretty much what it comes down to with my question at the moment at least for how things are right now (not the matter of which ecosystem for statistics likely will be common, like tidyverse is common, which is another question about prediction).

So then, according to what I just wrote and explained, what is the best option right now for a new Julia end user, disregarding personal preferences, needs and all that subjective stuff? What will give a person the most meat? If I go to a restaurant, I will look at the menu. I will ask the waiter, “Waiter, please, using your knowledge, think and advise me. What is the best option on the menu for someone new here? What will get me the most in this situation right now, seeing that we know about this factor here, and that factor there?”

All of these criteria are subjective to a certain extent.

Perhaps you missed the explanations above — the ecosystem are in the process of maturing, and there may not be a “best” solution at the moment. In some cases, you are lucky if there is a packaged and well-maintained solution for something.

Also, “statistics” is a very wide field, and related software ranges from data wrangling (data cleaning, database management, large data, handling external formats) to estimation (frequentist and variations, ML, Bayesian, nonparametric) and descriptive stats.

As @kevbonham suggested, it is best to start from a concrete problem you need help with.

Plots.jl is indeed stable.
Gadfly is not one of the back-ends.
It really comes down to preference for most things.
Plots is less grammar of graphics than Gadfly which some like and some don’t

1 Like

Coming from R too, I did the opposite and just throwed away what I knew and what I was expecting and started new with julia with Version 0.3 back in the days.
My Julia experience was so refreshing because in fact I was suffering with R as my programming experience in general is much broader (C, C++, C#, perl, Java, …) but in data science R is some kind of standard, actually R with C packages is some kind of standard, and I was forced to use R.

I did not threw away R but Julia gave me an option whenever I need or like to (which is “always” for new tasks).

But I recommend: don’t stick with R thinking, start new.

2 Likes

I actually understood. (I was actually sometimes wondering if someone did not understand me.) The question still stands, of course. I know very well that the packages are maturing. That is not the issue. Obviously there is something such as the best, which you can ask about almost anything, and obviously you can ask what will give the greatest results with the least effort (taken as it is, if you know what I mean), regardless of your personal issues. If there are 10 things of some sort, and they all have verily existing qualities, then you can rightly ask such questions. I suppose I’m a serious objectivist and realist :stuck_out_tongue: Maybe there are at the same time (personal) relative considerations, but my question already rules that out.

I don’t think this has anything to do with being an objectivist, realist, or fantasist - as others in this thread have attempted to point out, different people have different needs, and “statistics” is a very large and diverse field, served by tons of Julia packages.

It is for these reasons that asking after a “best” package or ecosystem does not make sense - it will be entirely dependent on what you are trying to do.

3 Likes

oxinabox, I was looking at its Github page and there seemed to be a lot of issues open. I didn’t check them all, but seeing so many issues open of course alerted me and influenced my considerations.

I wanted to try Plots with Unicodeplots (just for fun), and GR was automatically installed. I thought to check out GR, but when the standalone window was opened to plot some graph, a VIRUS was detected and removed (by heart it was the executable file gksqt.exe, which also tried to establish an internet connection)! It also looked as though my Julia installation was immediately bugged! Julia.exe was suddenly not in its original position anymore, probably removed by the virus checker. I took this seriously and was gravely alarmed. It was probably false alarm and no big deal and I had to somehow deal with it… Well, come to think of it, it is a big deal :stuck_out_tongue:

oheil, no problem about the Julia way. I actually prefer it, but, as mentioned, an R-ish way if it’s packages might speed things up to more quickly get to getting cracking. As a language Julia in general seems greatly superior to R in different ways, which is why I prefer it.

I have decided to check out Queryverse for now. I probably won’t use all of its packages. Is it possible to remove packages I don’t care for at the moment, without breaking Queryverse? (For example, I saw PyCall and Conda, but for now I have no care for them. At the moment it’s a waste of space to me.

Or, possibly, that you are not understanding what multiple people are trying to tell you here.

Just a thought :wink:

Yes it has to do with it, because my initiating question, including other elements too, was so posed, and I am the OP, and some seem to tend away from that question.

I’m sorry that you didn’t get the answers you’re looking for, but it looks to me like the consensus amongst experienced users is that this is because the question doesn’t have an answer.

It’s a bit like asking “what is the best colour” - you can of course specify that people aren’t allowed to answer “it depends on your objective”, but that won’t change the fact that there is no objectively “best” colour.

4 Likes

I’d like to remind you and everyone else reading this thread, that 9 times out of 10, “antivirus software” is absolute garbage and should not be relied on to actually tell safe software from malicious software.

2 Likes

Depending on what you are working on you might not be able to completely abandon R. For data cleaning and descriptive statistics I use DataFrames, DataFramesMeta and save in JLD2 or RDS depending on what is the purpose of the data. For example I cannot find an equivalent of R’s pivot_longer or Stata’s reshape long, or describe the survey design of a dataset as with library survey in R or svyset in Stata.

RCall works seamlessly for any function that you want to use but cannot find the equivalent in Julia. For example:

using DataFrames,RCall
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
@rput df
R"""
library(...)
x<- function(df)
"""
@rget x

For plotting I use Plots to get an idea of the data and then the excellent PGFPlotsX for plots that I use in LaTeX documents.

In general, I try to stick to Julia for the most part and as the ecosystem matures I try to rely less on other languages. It works for me because data analysis and statistics is a small part of my workflow which is done entirely in Julia.

2 Likes

Have you looked at https://juliadata.github.io/DataFrames.jl/stable/man/reshaping_and_pivoting.html - unstack does long to wide for DataFrames.jl