Julia losing popularity among Data Science users (KDnuggets Software Poll)

Just to add a different perspective and not intended to step on toes. I come from a biology/genetic background without any formal programming. I primarily started with R and was greatly relieved when I stumbled across Julia a few years ago. I still use R because of collaborations and because of various packages that exist but I prefer Julia. I would argue that R is MUCH harder to deal with when you need to do something that is not “out of the box” and has some computational intensity to it and I am not referring to highly complex problems. There were numerous times where a package existed in R that did almost everything I needed except a few small but significant things. I beat my head against the wall trying to figure out how to wrangle it to do what I wanted but finally figured out that I would need to jump in way over my head with C++ and connecting it to R (i.e. the two language problem). It was impossible (for me) to understand the code when it mixed between R and some other language, let alone how I would go about changing it to do what I needed. In exasperation I decided I needed to learn a few more languages to get it done. I fortunately stumbled across Julia at the time and was amazed at how simple it was to write some basic code that actually ran sufficiently fast. I was able to write code to perform some genetic analyses that ran in a reasonable amount of time and to my surprise I was able to parallelize it with base Julia code. I also found Julia to be as simple to learn as R. As a language, it seems easier than R because it seems to make more sense and is less scattered (only my personal opinion). It may be that multiple dispatch seemed natural because I had not really been exposed to other paradigms in any significant way and R seems to hide (not intentionally) these ideas from basic users like myself.

While the manual is definitely more for individuals with programming experience, it has also been very helpful in helping me understand programming practices and paradigms that I had not been exposed to (without any formal training). The community has be exceptionally helpful and kind as well. Beyond the awesomeness of Julia (i.e. multiple dispatch), I think one of it’s greatest attributes is that anyone can contribute. While R is open source, most people cannot contribute to significant packages because they need to know multiple languages to do anything useful. In Julia, an inexperienced programmer can work their way through even base Julia code. I am not suggesting that it is a walk in the park to understand everything, I am saying that it is possible and that with just a little exposure one can understand a great deal when it comes to packages that have been written that may pertain to one’s field (i.e. not base Julia).

6 Likes

Also a biologist here, working with other biologists moving things to from R to Julia (although more numerical, with less data science)

My take is that base R is harder than Julia but more people know it, and they have forgotten the pain of learning a new language.

I personally learned Julia far faster than R, as knowing Python/Ruby/Haskell/C makes Julia seem pretty obvious, while R is quite objectively a weird, inconsistent language.

I also see other people struggling with R. We have fortnightly workshops in Biosciences school basically so everyone can learn R. Julia will need similar things at some point, and I guess it would eventually be easier for people because the syntax is actually consistent.

@austin-putz it seems pretty common for biologists to not even realise how mediocre and confusing R is, as its all they know.

Edit: But I also have to agree that the data science ecosystem is way ahead in R. We still do data/GIS in R! then models in Julia for modularity and speed.

5 Likes

I think that most people complaining about the Julia manual are merely complaining about their own inertia they need to overcome to make a change in their lives and learn something new. The manual really is not that bad, it has been fantastic for years already. Some special parts of it are a bit vague, but all the language essentials should be crystal clear from it… and if not there is a community that helps.

However, I agree that currently Julia is best suited towards people who are willing to dive in and make contributions or write packages, but I think the 1.0 release should settle that issue.

In my opinion, Julia is the perfect language to be working with, since anytime you encounter an issue there exists an entire community of developers open to reviewing code or fixing bugs. You could never get this kind of an experience with a proprietary system, and you will also be learning so much in the process. If you’re not really able to spend time on it to learn, then you probably don’t really have room for anything new in your life, whether it be Julia or somthing else. Learning a new thing does require some time investment, no matter what it is you are learning to use.

Why is Julia special? It is special because the paradigms of collaboration and computing it relies on. It should actually be very easy to get started with Julia, one merely has to overcome the inertia that is holding oneself back from reading the manual or starting a new package. Julia actually has very little friction in the way of making progress with goals and tasks. Of course, if people don’t have any free time to spend in the REPL and learn the manual and watch presentations online, then they are not going to be successful at learning the language, since they don’t have time to invest. That’s not an issue with Julia, since so many introduction presentations, blogs, documentation, and community discussions exist, the issue simply is with the individuals who are not able to spend time on it. To those people: just wait until you are ready to switch.

And if it turns out that Julia is really truly not good enough for you, then this is also an opportunity for you to create the package with the functionality you need, which will help you get better at using Julia. If you don’t have room for that in your life, then just wait until you are ready I guess.

4 Likes

Honestly though, @austin-putz makes a good point. The way he’s looking at the language isn’t what one would call “the language”, it’s the statistics stack. Julia has done well with people wanting to do scientific computing because these people are usually trained in C++, HPC, linear algebra, etc., so reading a manual and writing some code is what they do. People who are trained are trained biologists are not looking to write their own packages on fancy new algorithms. It’s perfectly okay that many people want a fancy calculator. Julia can be the fast parallel fancy calculator, but if it doesn’t serve the function of a calculator it will alienate a lot of the not-so-technically-inclined scientists looking to do some data science (which is pretty much anyone who isn’t doing scientific computing or genomics/bioinformatics in my experience).

That explains my position here:

Systems biologists are the kind of biologists who write Gillespie algorithm simulations and solve differential equations. While a growing segment that has embraced Julia quite well, it’s still far from the typical biologist.

I think there is something to be said that the data science and statistics stack is pretty confusing. Do you want to do density-based clustering? Here you go:

http://clusteringjl.readthedocs.io/en/latest/dbscan.html

Oh, there’s no example code? Well there’s some example code for using the package on this page:

http://clusteringjl.readthedocs.io/en/latest/kmeans.html

Oh wait, if you use that on DBSCAN you don’t get the clusters back correctly since result.assignments isn’t what you should be using? This page Overview — Clustering 0.3.0 documentation says you should do assignments(result) instead, and that finally pieces together a working code. That’s how I learned the package. If you think that the “Julia documentation” is simple, to people who are using Julia to do real data science, this is what you’re saying is simple and refined. It’s not, and for reference it’s not even well-maintained enough to merge PRs to fix these docs. Every biologist these days has to cluster RNA-seq, microarray, etc. data, and so I’m not surprised they don’t like our docs.

Another issue is an ANOVA. Most biologists and social scientists use this as their one and only test taught in their one semester statistics course. And Julia?

We point people over to

Yes, anyone who has taken mathematical statistics and has the appropriate background probably knows that all of those standard statistical tests fall into some regression framework known as generalized linear models and so there is a translation of those common terms to GLMs with a given link function. But seriously, “ANOVA” is a common term for a reason. And many other statistical algorithms are encoded in here as well. But this means that if you Google something as simple as logistic regression, here’s what I get:

https://www.google.com/search?q=julia+logistic+regression&oq=julia+logistic+&aqs=chrome.0.0j69i57j0l2j69i64l2.2079j0j4&sourceid=chrome&ie=UTF-8

No package for logistic regression pops up? Why not take the code from

and make that into a small and well-documented logistic regression package with a GLM dependency for the backend?

There is a big tendency to believe that other people should know what you know. All of the background is always important. Yeah I know. Compiler experts will scoff at how little I know about the compilation process: how could I ever write good code? Other mathematicians think I should study more functional analysis before doing anything more in modeling because without every proof of existence and uniqueness in mind, how could I ever know that my models make sense? Some biologists think I should do more of my own experiments because how could I ever possibly understand the data without running the assay? We’re all on a high horse of our own knowledge. Some people will need to use results from your discipline without knowing the full backstory. Get over it and give them a quick working way to use it.

That doesn’t mean that we need to get rid of GLM and all of Distributions.jl’s detailed glory. That just means that we should have a simple single function for each common method, a tutorial with some code that a user can copy/paste and then swap in their data, and some videos explaining it at a very high level for a statistics 101 student. Then it should be pieced together into one coherent tutorial story, so that way when someone takes Stats101 taught in R, they can know one document which has the 1-1 mapping of what they learned and Julia. We do not have anything like this, and no matter how much work we put into Julia Base we will never have this unless we specifically work on it.

37 Likes

@ChrisRackauckas Make I take your point about Anova and exaggerate for effect?
It seems to me like Julia is at a crucial stage here. Julia is a computer language. No-one berates FORTRAN for not having in-built Anova functions. Sure, you join the discussions on the particular library which supplies Anova tests, but these are independent of the language.
Julia is being compared to integrated packages with GUI IDEs such as Matlab. May we discuss how valid those comparisons are?

Secondly my response to the Anova point was “Well - Julia is a language. Go write your own Anova module”.
However that’s nto the full story. Julia has a ‘blessed’ set of packages with JuliaPro and then the packages which are in the Julia registry. I think the Juli aPro approach is good - for the systems people out there it stops a Wild West of incopatible modules. All you need ask is - well, does your stuff work with JuliaPro? Thats all we can support.

I digress. I want to say - look at all the fuss we regularly have over Plots. “What do you mean there are multiple plot packages. What do you mean they can be slow”. There’s More Than One Way To Do It. And that is a healthy thing. Julia is a language, not a pre-packaged IDE with Meccano (Tinker-Toy) programming.

Discuss!

I try to never let absurd business-speak rub off on me, but I think this is one instance in which some may be appropriate: I think what we are seeing here is that some people are expecting “a product” while most of us currently using Julia are expecting just “a language with a bunch of code that some dudes wrote”.

When I first started coding basically my introduction consisted of a grad student giving me links to the GNU g++ and gfortran docs, the ROOT docs (which seemed mostly like automated doxygen output at the time with a few tutorials) and saying “go nuts”. Right when I first started it literally would not have even occurred to me to use another library. If I were instead presented with some slick IDE with a huge library of dependencies (e.g. a really nicely setup PyCharm with Conda) I imagine I’d probably have very different ideas about what to expect when I start using a language, and I’d probably think of it a little more as “one package” instead of just “some compilers and stuff”.

I think we need to be honest that Julia is just not in a state where people wanting to see “a product” are going to be satisfied, because that’s really an extremely high bar (I would argue a much higher bar than it takes to be truly productive in most contexts). It’ll be there in a few years, but it needs some time to mature. Hopefully the people who are really dissatisfied today will revisit in a few years and be much happier.

13 Likes

Yes, that’s what I meant by the two camps of people looking at Julia. Those of us who love to evangelize about how Julia has made our lives so much better :grinning:, need to be careful to try to discern the audience, and send the appropriate message.

7 Likes

I have an implementation for multinomial logistic regression (and other models) which is part of a chapter of my dissertation (Econometrics.jl, will be out around 0.7/1.0 release).

I would go even further and say that when people talk about R they often mean the whole social structure around it that makes using it easier for them. Blogs, colleagues with knowledge, etc.

But I think we need to be clear about the difference between the social support, the quality of the stack, and the technical merits of a language. The Julia stats stack does need a lot of improvement (I often still use R for stats+GIS). But it also can’t easily build the mass of blog posts and the embedded social support R and Python have. That’s going to take a whole lot more early adopters and semi-technical users to generate it.

And it’s not always the case that people use R for the stack. I work with people who also use it for numerical simulations. The quality of the stats stack and the social support bleeds into their perceptions of the language as a whole. It makes everyone forget the amount of quirks they are working around by googling for a solution every half hour, or waiting literally days for results.

3 Likes

I think the Quora article puts things very well

1 Like

R is extremely nice in a pedagogical setting for courses that are either not computationally intensive, or have canned code as R packages — it is very easy to set up and get started with, there is ample documentation/examples/textbooks, and you can just tell the students to install it on their own machine without the licensing hassle.

Consequently, R is a language that most people just encounter at some point in certain fields, and when they have to do simulations, it is naturally the first thing they reach for, only to find that it does not scale. This usually happens after they invest a lot in their code. From a psychological point of view, I can understand why at that point people try to look for a solution within R (maybe if I just vectorize this, rewrite that in C, …), instead of switching to another language, which can be daunting.

1 Like

As someone who (tries to) teaches stats and simple programming to biology students, I’ve really got to emphasize how simple it has to be for them. ANOVA must be a simple function somewhere.

7 Likes

Languages gain massive popularity when they are easy to learn and use. Batteries-included languages are the ones that overtake. Here is How to create a popular programming language? to give you some insight.

Here, I quote from the Julia website:

Julia’s Base library, largely written in Julia itself, also integrates mature, best-of-breed open source C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing.

While I understand the reasons behind the decision of moving too central packages out of base, and hope that this will accelerate the develpment of the language at the current stage, I gues it will very negatively affect the language adoption after 1.0 to leave such vital packages outside of Base.

The promises of Julia above simply indicate that I shouldn’t say using LinearAlgebra, Random, StatsBase, DSP, FFTW, Distributed, SpecialFunctions, or any core String packages, even if these are distributed within a stdlib. I imagine that, after 1.0, these should join Base again, and external packages with very good reputation to enter stdlib and if frequently used and time tested to even enter Base. I also know that Pkg3 relieves much of that pain, but the REPL is still a very vital component of Julia.

In 5 years or so, I see Julia would become the defacto in many scientific computing areas if we focus more on the usability and lower the barriers against new users by extending the documentation with lots of simple examples and a decent plotting package with seamless installation. I say 5 years with full confidence because Julia has the technical merits to be there. MATLAB, being a terrible numerical computing package, is extensively used in academia and outside for the very same reasons I mentioned, Julia has all the capabilities to overtake it.

2 Likes

Why? Being in Base is orthogonal to being loaded by default. For example, Base.Test is obviously in Base, but it’s not loaded by default. I think that loading some stdlib packages by default should be done by the default .juliarc, which would be a good way to make it “batteries-included” but make it very easy to turn those defaults off.

5 Likes

As I said, In the REPL, you must say using ... and in case of adding these to startup.jl I notice a non-negligible delay when starting up the REPL, I personally examined this.

I am not convinced that this is true. In general, if we are honest with ourselves, it is very hard to predict which languages become widely popular.

Libraries of course do matter, but there are various trade-offs involved when you consider bundling them with the language core — most importantly, you have to sync the updates, which creates a burden on maintainers and slows down the propagation of improvements.

Consider, for example, DataFrames.jl, which is arguably an important library (eg R has this data structure built-in). Yet in the past year alone, it has seen two minor and numerous patch releases. Bundling this with the Julia distribution would have delayed these.

The Julia solution is to give you the benefits at very little cost — the new Pkg.jl is very fast and convenient, so any library is a few seconds away. You will (soon) get the best of both worlds.

7 Likes

In Python, almost everything is separated out into modules that are not loaded by default. Even just to do something as basic as Julia’s joinpath you have to do

import os
os.path.join(...)

Looking at the Python code I’m working on right now, it starts like this:

import argparse
import glob
import os
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as pp
import tkinter
import tkinter.filedialog
import numpy
import json
import warnings
from datetime import datetime
...
(Then, imports of several of my own modules, which again import equally many modules)

The only language I know where everything is dumped in “Base” is Matlab, and even they now have packages that must be imported.

Edit: Just because I’m on a roll, here are imports from some of my other open files, that do basic stuff, and that I doubt are needed in Julia:

import sys
import subprocess
import math
from enum import Enum

I’m not complaining about Python, but I think this is strong argument that the statement you made (which I quoted) is inaccurate, or perhaps that you are misunderstanding what “batteries included” means.

6 Likes

“Battery included” as Shef_Shebl propose could mean “battery drained” if you think about small simple mobile app written in Julia which will consume waste of energy for useless (for app’s purpose) linear algebra :stuck_out_tongue: (BTW. energy consumption is probably biggest problem for python in mobile ecosystem)

But from other point of view - not everything is so easy as Tamas wrote too! It could be fine just do Pkg.add or Pkg.clone for lone wolf scientist but in industrial area everything outside stdlib could be prohibited (or difficult to get) for security reason. Something what is in stdlib is also more safe from future compatibility problems.

I can’t really imagine a scenario where packages aren’t allowed for security reasons, but Julia is. It is, after all, a general purpose programming language with an FFI, and depending on it being “secure” in any sense is questionable practice.

Assuming 24/7 network access could indeed be a problem in some cases, but there are ways to install packages offline.

Talk to @anon94023334 about this (although I think he may already be totally talked out on the subject).
Company or government security rules are frequently not subject to any common sense :grinning: