Julia losing popularity among Data Science users (KDnuggets Software Poll)

Honestly though, @austin-putz makes a good point. The way he’s looking at the language isn’t what one would call “the language”, it’s the statistics stack. Julia has done well with people wanting to do scientific computing because these people are usually trained in C++, HPC, linear algebra, etc., so reading a manual and writing some code is what they do. People who are trained are trained biologists are not looking to write their own packages on fancy new algorithms. It’s perfectly okay that many people want a fancy calculator. Julia can be the fast parallel fancy calculator, but if it doesn’t serve the function of a calculator it will alienate a lot of the not-so-technically-inclined scientists looking to do some data science (which is pretty much anyone who isn’t doing scientific computing or genomics/bioinformatics in my experience).

That explains my position here:

Systems biologists are the kind of biologists who write Gillespie algorithm simulations and solve differential equations. While a growing segment that has embraced Julia quite well, it’s still far from the typical biologist.

I think there is something to be said that the data science and statistics stack is pretty confusing. Do you want to do density-based clustering? Here you go:

http://clusteringjl.readthedocs.io/en/latest/dbscan.html

Oh, there’s no example code? Well there’s some example code for using the package on this page:

http://clusteringjl.readthedocs.io/en/latest/kmeans.html

Oh wait, if you use that on DBSCAN you don’t get the clusters back correctly since result.assignments isn’t what you should be using? This page Overview — Clustering 0.3.0 documentation says you should do assignments(result) instead, and that finally pieces together a working code. That’s how I learned the package. If you think that the “Julia documentation” is simple, to people who are using Julia to do real data science, this is what you’re saying is simple and refined. It’s not, and for reference it’s not even well-maintained enough to merge PRs to fix these docs. Every biologist these days has to cluster RNA-seq, microarray, etc. data, and so I’m not surprised they don’t like our docs.

Another issue is an ANOVA. Most biologists and social scientists use this as their one and only test taught in their one semester statistics course. And Julia?

We point people over to

Yes, anyone who has taken mathematical statistics and has the appropriate background probably knows that all of those standard statistical tests fall into some regression framework known as generalized linear models and so there is a translation of those common terms to GLMs with a given link function. But seriously, “ANOVA” is a common term for a reason. And many other statistical algorithms are encoded in here as well. But this means that if you Google something as simple as logistic regression, here’s what I get:

https://www.google.com/search?q=julia+logistic+regression&oq=julia+logistic+&aqs=chrome.0.0j69i57j0l2j69i64l2.2079j0j4&sourceid=chrome&ie=UTF-8

No package for logistic regression pops up? Why not take the code from

and make that into a small and well-documented logistic regression package with a GLM dependency for the backend?

There is a big tendency to believe that other people should know what you know. All of the background is always important. Yeah I know. Compiler experts will scoff at how little I know about the compilation process: how could I ever write good code? Other mathematicians think I should study more functional analysis before doing anything more in modeling because without every proof of existence and uniqueness in mind, how could I ever know that my models make sense? Some biologists think I should do more of my own experiments because how could I ever possibly understand the data without running the assay? We’re all on a high horse of our own knowledge. Some people will need to use results from your discipline without knowing the full backstory. Get over it and give them a quick working way to use it.

That doesn’t mean that we need to get rid of GLM and all of Distributions.jl’s detailed glory. That just means that we should have a simple single function for each common method, a tutorial with some code that a user can copy/paste and then swap in their data, and some videos explaining it at a very high level for a statistics 101 student. Then it should be pieced together into one coherent tutorial story, so that way when someone takes Stats101 taught in R, they can know one document which has the 1-1 mapping of what they learned and Julia. We do not have anything like this, and no matter how much work we put into Julia Base we will never have this unless we specifically work on it.

37 Likes