Julia losing popularity among Data Science users (KDnuggets Software Poll)


#1

https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html


#2

Given that it was only 1.2% to begin with I suspect the result is prone to large statistical fluctuations. It’s also a much younger language than most of the others listed here, especially Python which seems to be increasing it’s dominance.

I can tell you from experience that Julia is pretty unpopular in data science, but I suspect there’s still plenty of opportunity to change that.


#3

To put things into perspective, there have only been 14 votes for “I have used Julia in the last year for a real project”.


#4

IMO this is, at least partly, classic clickbait: make a non-representative small sample poll, add some text on how some languages/toolkits are “losing” and others are “winning”, and it will be shared and discussed widely by both categories.


#5

‘losing popularity’ implies that Julia previously enjoyed popularity among data science users, but I don’t think it has ever occupied more than a small niche in data science.

For the majority of data scientists, I think this has made sense – the language has not reached stability. That will be changing soon, making Julia attractive to a broader community.


#6

Clickbait? I’ve been following discussions in this discourse pool. This is the kind of reply that makes this community look unfriendly to me.


#7

Generating weakly implemented polls about popularity of various things is a common technique to generate traffic for sites. I am sorry if I offended you, but I don’t think that pointing this out can be construed as unfriendly. Also, I don’t think you are responsible for or affiliated to this site so my comment was not about what you did.


#8

This post was flagged by the community and is temporarily hidden.


#9

Without considering the fact that this could be something without any statistical implications, I have to point some facts:

  1. Python design dates back to 1980s, hence it is way older than Julia that started in 2012.
  2. I have been using Julia since v0.2-beta (or something like that). From my humble point of view, Julia language really got mature for starting small projects in v0.4 (why? because it was when I stopped compiling master to use in my projects :D). AFAIK, v0.4 was launched in 2015.
  3. In Brazil, we are starting to see quite good adoption given the language age. My wife was in a workshop where there was a special session called: “Tutorial: Julia language for Geophysics scientists.” Yesterday, I saw a full programming course (scientific programing) in another Brazilian university that was entirely prepared in Julia.
  4. We are starting to have toolboxes / packages that are unique to Julia language. For example, I have been using differential equation solvers for my entire academic life. I have not seen yet an open-source package that can do things as easily as we have in DifferentialEquations.jl.

Hence, given all those facts, it is amazing that Julia is even mentioned that early. We are talking about a new language that has less than 3 years of production-ready stage (again, from my point of view). Conclusion: the future is bright for Julia, believe me :wink:

EDIT:

  1. My students of the discipline Rigid Body Dynamics will receive 1 extra point if the final project is written in Julia. Hence, we have 7 new Julians :smile:
  2. I will do my part and I will offer this year a Julia course in my institution.

#10

I think the data this year might have been skewed by RapidMiner actually campaigning for votes.
Next time they do this, maybe a “get out the vote” campaign on Discourse, Slack and Gitter would skew it in our favor.

Also, when I’ve been at #ODSC (Open Data Science Conferences), I talked to many people who were interested in learning about Julia, as soon as it was released (so, I expect a lot of people in August to start taking a new look at Julia, I just hope that things are in a good state by then for people looking seriously at Julia for the first time [that’s not a criticism, it’s just a matter of Pkg3 getting stable and everybody pitching in over the next two months and making sure all the packages that are still being maintained are updated for v1.0, and maybe getting a triaged list of packages that are well-tested on v1.0)

Can you expand :grinning: on that, please?
What reasons have people given? Are things things that can be addressed easily (or are in the process of being addressed already)?
If I were to hazard a guess, I’d say 1) database access 2) better handling of non-UTF-8 data from databases and CSV files 3) better parallel programming support (I’m looking forward to the PARTR stuff, after seeing the talk at the C.A.J.U.N. Meetup, see the video of it at: https://www.youtube.com/watch?v=YdiZa0Y3F3c)


#11

The equilibrium of these games is everyone focusing effort on campaigning, and these efforts more or less cancelling out, with a lot of effort spent on the whole thing as deadweight loss. I would be sad to see “please click here to inflate our votes” messages on forums I visit.

Are there results actually important for anything? Besides generating visitors to sites.


#12

Yeah, I second that. Obsessing over these things is not going to help anyone.

Well, here’s my two cents on this. “Data science” is a recently made up term. As far as I can tell, the vast majority of us working in the field aren’t actually specialized in anything related to our jobs (although certainly some are, e.g. NLP people), though we may be very highly specialized in terms of our educational backgrounds. In my experience this has led to wildly divergent opinions on even practical matters such as tooling.

At the risk of over-generalizing, I think there is a prevailing attitude in data science that anything that Python isn’t “good enough” for is not worth doing, you should mostly write scripts and not worry too much about re-using code. If that’s your view, I can see why Python seems like the ultimate tool: it has a staggering amount of pre-existing code available for it, and it’s a very fine scripting tool. I have to admit that, because of the way I learned programming and computing, those attitudes are extremely frustrating to me, but I also have to admit that they are perfectly valid in the majority of data science roles, and that people with this attitude are getting a huge amount of valuable work done without caring one bit about what I think about how they’re doing it.

As data scientists we are also frequently asked to work on data that is presented to us in some truly nightmarish formats, and we are constantly having to deal with awful things like csv’s and SQL (awful mostly because it’s actually about 100 different things masquerading as 1 thing). Therefore, there is (very rightfully) also a huge emphasis on tools for doing things like querying databases, and I suspect that probably a majority of data scientists when approaching Julia would be primarily interested in situations like what @Liso described above, where it was very correctly pointed out that we’ve sort of abandoned the idea of having a universal database API for Julia. This is a real issue. Database support is hugely important (especially if you’re a data scientist but also if you’re not) and it really kind of sucks in Julia right now. We might have some great support for some specific databases, and they’ve done a great job on ODBC.jl, but frankly there’s nothing as clean, simple and easy as sqlalchemy in Python, which must have taken a monumental amount of work to get it into its current state. If you are new to Julia and don’t know where to look for things like ODBC.jl, LibPQ.jl, JDBC.jl or MySQL.jl, things look much worse than they really are. There are also some people who would come in, perhaps learn about the packages I’d mentioned, but only see that there is no sqlalchemy equivalent and immediately dismiss the whole language.

My counter to all that is basically that it gets the priorities completely backwards thanks to the existence of things like PyCall and JavaCall. Yes, I need database support, but pulling data from a database isn’t really that complicated. If I have to do it through PyCall, and it’s a little slow, I don’t really care. You know what sometimes is that complicated? MILP’s with millions of variables, stochastic constrained optimization problems, POMDP’s, solving stochastic differential equations. When I first started using Julia, I had recently gotten really aggravated with the misery of trying to do large MILP’s in Python. Everybody around me was using PuLP. It was really, really slow and even uglier than it was slow. Then I wrote up a problem in JuMP. It was this tiny little thing that fit on one screen, and it looked almost exactly like it did when I wrote it out algebraically in LaTeX. It took 0 effort to convince the guy I work with who was working on these things for years longer than I have that we need to move everything over to JuMP. Now I’m getting ready to try a more general version of those problems that will require a stochastic method like simulated annealing (possibly using Hamiltonian updates like in @Tamas_Papp’s package?) and this would have been impossible in Python, because the updates are going to require a huge amount of custom code. In fact, the engineering group at my company attempted something like this once in Python using canned tools and, after being happy with a toy problem, wound up completely abandoning it because they couldn’t get it to work on real problems. Why? It was too slow, and too hard to modify. If I can’t get it to work on Julia, I can be confident that my tools aren’t the problem.

Lastly, as I’ve already talked about extensively elsewhere, some of the simple stuff in Julia is really so nice and it can be so hard to convince people of that if it’s built into the core of their being that they should never write a million iteration loop. I don’t want to do everything with some complicated API, let me just write simple code using Base. Can’t think of a way to do something that’s not all database operations? Fine! Just write some code, use a Dict, use a Vector, create your own custom struct and put it in a million element array, just do whatever you want, that’s how writing code works. I’ve recently had a data manipulation task that started out very simple and lo and behold, it turned out I had to do a whole bunch of stuff with quadratic time complexity that would have taken an hour in Python, or I would have had to go hunting for the right package to do it (if that were even possible). In Julia it was all really simple stuff. I use mostly Base for the vast majority of what I do, just like in Python you in principle could use the stdlib to do most of what you do, but you don’t because it’s too slow. I lose numpy the second I want to put a Python object into it, Python’s an OO language, it’s supposed to be all about writing objects, so then what good is numpy? (well, lots of good, just not for Python objects)

Nope, it’s never been an issue (perhaps surprisingly). So yay for that :smile:

The reason I’m so passionate about it that I wrote this huge rant on Sunday afternoon, and why I am “scared” of Python is that in a way Julia and I are in very similar situations in our career in data science. We grew up doing physics and we want to do something interesting enough that it’s more than just feature selection and canned solutions, and if we can’t find that there is no reason for us to be there. When that inevitably occurs, we will have to look elsewhere for jobs. I may not find one, but fortunately, Julia has several already.


#13

3 posts were split to a new topic: Re: moderator action on “Julia losing popularity … (KDnuggets Software Poll)”


#14

That website looks like crap.

Maybe it should be losing popularity…


#15

You’re lucky, I’ve seen issues on GitHub, posts on StackOverflow, questions on Gitter, etc. where people have run into such problems, fairly frequently. Often the people don’t even realize that’s the problem, they just think the data is corrupted.


#16

Maybe it does:

2018 answered 2052 participants
2017 about 2,900 voters
2016 - 2,895 voters

But that doesn’t means we could be satisfied!


#17

I’ve replied there now. I do not generally browse Stack Overflow, but I do always respond to tags here on Discourse.

There was not very much interest in maintaining a generic database interface. For my work, it made much more sense to pick an interface (DataStreams.jl) which already supports many types of data output.

If anyone wished, it would be easy to create a database interface and support PostgreSQL using LibPQ.jl. It’s intentionally very easy for someone to do so. But unless it is useful for my work, I cannot devote time to building it from the ground up.

It’s also worth noting that the database interfaces which had been written were nothing more than sets of methods for connection, query, fetch, etc. One still needed to write custom connection and SQL for each database.


#18

DBAPI is more than set of methods because it has to be generally accepted API.

I am afraid that version of DBAPI which we have now is obsolete because Nullable as I understand is supported in different way now.

So it needs more work. I understand that it is not your personal goal.

Your position is fully understandable. There are more people in same situation. We need to work with data. We could glue some C/C++ library, but working on API normalization is too much.


#19

Ah yes, what I meant was that it was an interface for packages to implement, but wasn’t a framework itself like SQLAlchemy is. It was more like Python’s DBAPI 2.0 + a shared namespace.


#20

I agree there is value to having a uniform database API. It just makes things a lot easier for the person who is using it and has to interact with a hundred different types of SQL databases. My solution for the time being is JDBC.jl, though I’m hoping that over time there will be less and less need for me to use anything other than Potgres and I’ll just use LibPQ.

It’s definitely defunct, the maintainers told me so when I overhauled JDBC.jl (which no longer uses it). I might be willing to undertake writing a uniform interface (which I think is really the most valuable aspect of sqlalchemy), but I’ll only attempt it if I know for sure that everyone is on board (i.e. at minimum ODBC, LibPQ, MySQL, SQLite, I can handle JDBC).