Nice work! I have been thinking about automating some kind of a trivial unit test (“things ran OK”) for such things, so that I update them when the language moves on.
That’s a good idea, for some of those posts I actually do not have a smart way to rerun them to check they work on Julia 0.7: I’ll just hope such basic things do not change too much. Let me know if you find a smart way to implement unit tests for blog posts!
@xiaodai good point: I’ll edit the post (and possibly allow comments so that it’s easier to give feedback). Concerning the size of the dataset, the one I’m using is not super large, but also not trivial: 21 columns, over 200000 rows of different types with missing data.
“trivial” in the context that we have 3Ghz CPU and 16G of RAM. Of course those things aren’t trivial but they are “trivial”.
I am not sure I understand what kind of tutorial you are asking for then. By that definition, “nontrivial” is something that does not have an established workflow for hardware that most people have access to, so it requires custom tools or tricks. Given that even the basic tools (eg DataFrames.jl
) are under development and need some time to mature to a stable API, I am not sure that “nontrivial” stuff belongs in introductory tutorials.
Now I understand what you’re getting at: I’ve tried to only use performant solutions in the tutorial (so that they would scale for users with larger databases) except two examples where I state it explicitly:
- scatter plot of one column against another: this one just doesn’t make a lot of sense for very big data)
- an example of how not to do something (sorting a whole chunk just to get the two largest elements)
If you find obvious performance traps please signal them in the comment section of the blog posts so that I can fix them!
I didn’t sacrifice simplicity for performance (maybe you can group the data faster if you write your own super optimized sorting algorithm, but I wouldn’t recommend that to a beginner) but I agree that, when writing a tutorial, we should avoid solutions that don’t scale.
I just saw this and figured I’d comment. Just as an outsider to the Julia community (many of you seem like regular Julia users). I completely agree with Mason original post. I’d say that 90% of my time is in R and the other 10% in bash scripts so excuse my ignorance as a programmer.
Coming from R I even find the basic docs hard to understand (https://docs.julialang.org/en/stable/manual/introduction/). Especially when trying to learn how to use functions (can’t find the example I’m thinking of). There doesn’t seem to be much common ground for this. R is getting a lot more confusing for new students because of Hadley’s unquoted variables when inputting variables, but for the most part it’s very standard. For instance, I’ve read a lot of your docs for Julia and I was still trying to use double quotes to read in a data frame separator (instead of single quotes). Although you guys have a lot of documentation, simple things like this needs explained instead of talking about the sweet JIT compiler for 5 pages. New users (from R) probably don’t care about why it’s fast.
You guys commented a lot about how Mason mentioned macros. I don’t think we even use or have these in R (?). And in the docs I see stuff like “Lisp inspired macros”. We don’t know or care what this means. And that seems to be the end of the explanation. We don’t know what Lisp is or why you guys care so much about macros. I use some like @time and @parallel, but like some of you said, these are a more advanced feature and probably not for beginners. In general, because you guys use stuff from so many different languages, I’ve found it hard to learn Julia. Instead of learning one new language it seems like I have to go back and learn all these different languages. I see stuff in data frames about split-apply-combine, which I think comes from Hadley’s stuff, but if you come from SAS or something, this doesn’t make sense. Just more explanation in general would be nice. Coming from R I know exactly what you mean without explanation, but it doesn’t help new users without this background.
Also, the error messages in Julia are more of what I get when I try C++ or Fortran (cryptic, although I’m not experienced in either). Not like in R where they usually make a little sense. This is scary for new programmers from R.
The biggest thing that has deterred me from using Julia is the package manager. I’ve had a lot of problems loading and using packages in the past (maybe not currently as I don’t use a lot). Not understanding I need to load the sub-packages in some cases (I don’t even know if I’m using the right terms here). I quit Python long ago after I couldn’t install a specific package for animal breeding (pedigrees). I guess you guys would call this dependency hell. It sounds like you guys are making strides at this, but not there yet. This will immediately deter new people like myself. We need the performance for genotype files (that can easily get into fairly large data, > than the RAM we have on a laptop). So there is a lot of need in the scientific community for it to be easy to learn as we don’t have 5 years to learn a new language and get it to load certain packages. Many people like me are probably refusing to update to the latest versions because I’m so worried about packages breaking and my code won’t work and I’ll have to use R or Python to fix it before figuring out why Julia broke again. I know part of this is the growing pains of a new language before version 1.0.
One big complaint I’ve had is conversions. I cannot figure out Symbols and why I can’t just put my column names into a character vector array or something. I tried for about a day and gave up and quit Julia. A few hundred google searches later and many manuals couldn’t explain what was going on. I’m sure I’m not the only one. Even to convert a 2-dimensional array that only had 1 column to a 1-dimensional array to plot with a package took me hours and hours and then finally an email to a developer who then got upset I emailed him directly. R doesn’t have these things (that seem non-sensical to us, but probably make a lot of sense to you guys). Seems like I can convert about anything to anything in R which makes things easy for us. I understand a lot has to do with the performance you guys want, but at least make some docs on how to convert stuff. Float to integer or Array to data frame type of stuff. This has annoyed me a lot about Julia. I’ve tried to use convert() so many times with errors I don’t even try it now. Have never got it to work except on simple examples.
I’d suggest an R to Julia manual (I think someone commented they have one). Just like the SAS to R book that is quite good at this.
I agree with whoever stated that the code gets out of date fast. You guys are developing so fast that everything I see on stackoverflow or whatever is so far out of date it hardly does me any good. I’ve basically stopped looking at it unless it’s within the last few months because the examples never work. So either have someone delete these posts or update them with a new answer from the newest julia version.
A new IDE like RStudio would be nice. Atom is horrible and I stopped using it long ago. Especially the tab completion in atom is aweful and keeps annoying me. New users would like a better/easier interface than Atom.
All that being said I see a ton of potential with Julia, I apologize if I sound negative (trying to just be constructive). Many of you are doing some great things. But if it doesn’t get easier you won’t convert many R programmers like me full time. Probably will get a few Python users that are more programmers than R users. I only use Julia now to process genotype files and I use the basics (Arrays) because everything changes so fast you can’t rely on Julia packages yet (I’ve found). I know if I use R it will work, even if it’s slow…
I think that the main issue here is that the manual isn’t a good introductory material. It’s a manual: it’s a reference to everything. It has tended to work as an introductory material for those who already understand programming (because they know what to search for). Since the manual is the language reference, it cannot skip over things to simplify the material since sooner or later it does have to go over everything. Also, since it’s about the language and not about using the language, it doesn’t use essential packages which most users would make use of.
We really need a more tutorial-driven from scratch introductory source that is able to skip over the tough details and utilize packages.
Macros are quite common in SAS. AFAIK, Julia gets this macro idea from Lisp. I am also a heavy R user, but I like using macros in Julia.
Ross Ihaka, one of the R founders, used to think that R is not good enough for the statistics community and chose to work on a new statistical programming language based on Lisp (I have no idea what his progress is). See his presentation in 2008:
My point is, as a R user, let us embrace the macros in Julia
There is Noteworthy differences from R in the manual (the very first thing is the single vs double quote thing), but it doesn’t seem to get much Google love, unfortunately. It also isn’t that clear: perhaps side-by-side code samples would be better.
You are right that intro tutorials and intro documentation are needed. But there are some (Get started with Julia): did you find them, and if so, did you find them useful? Also, many Julia packages have examples on github. I find that a better learning material than the manuals.
In R, there is a very sharp divide between users, who run scripts of mostly linear computations (load data, perform some analysis, make plots/tables), and developers, who write packages and contribute to R. Thanks to the efforts of the latter, R’s data, plotting, and statistics packages are well polished, setting the standard in many ways. However, despite the maturity of its package ecosystem, R is not a programmer-friendly language, but you don’t encounter this until you try to do something nontrivial.
One of the reasons for creating Julia was to make it easier to write high-level but performant code. At this stage, most of Julia’s “users” would be among the package developers in almost all language communities. You are correct in recognizing that many things are under development, introductory material does not exist or becomes outdated very quickly, and in many cases there is nothing to document as some functionality is missing from the package ecosystem. It is best to adjust your expectations: it is unrealistic to compare the package ecosystem of Julia v0.6
to R, which has been around since 1995 (not counting S-Plus). It is very likely that it will take years for the APIs to stabilize.
At the same time, Julia offers great benefits. Most of these are apparent only if you program, ie write nontrivial code longer than a few hundred lines, that takes more than a few minutes or seconds to run. The price you pay for this is having to read source code, report bugs, and occasionally contribute fixes.
Only you can decide what is more important for you. That said, if you are willing to invest time into Julia, you could be the author of one of the great tutorials about topics that interest you. In my experience, contributions to documentation are always welcome and appreciated.
None of those tutorials have version labels though. Many people avoid them because they don’t want to accidentally learn the wrong version.
Just to reflect on some of the things you said. I personally came from R and found the shift quite easy. My introductory material was David Sanders’ introductory video on youtube, and I recommend that to everyone. But yes, you are right, julia needs an introductory book. When I learned R, in 2004, that was also very much true for R! But there are lots of good books now. R’s error messages have also really improved a lot in recent years (I still don’t like them). Julia’s manual is good, but it’s not a tutorial for beginners.
Julia is pre-1.0, and thus it is still mostly for the adventurous. You’ve started using it because you need the performance, which is great, but you are running into problems that people WILL run into when using beta software, especially for the average member of the R user base who are more interested in getting results than in developing code. This will change. And maybe you’ll help?
You write that you’re afraid of new versions breaking your code, but there’s nothing to be done about that at the moment. That doesn’t mean you can’t rely on Julia packages btw, just that you may have to update your code if you want to run it on a different version of Julia than you wrote it for. “Having a stable version that promises no breakages would be convenient for newcomers” is a matter of course, everybody realizes that. The reason it’s not there yet is that the language is still in development. But if you’ve followed here a little, you’d know that getting out a stable version that won’t have breaking updates for a long time has been THE core focus of all Julia’s key developers for more than a year - and they’re just on the cusp of having it.
Still, there are some of your complaints I don’t actually follow.
WRT error messages, I think it’s an acquired taste. I like that you can click them and be taken directly to the code that’s causing problems, and read it. That’s much more powerful than R’s.
I don’t understand what you mean with the issue with Julia’s package manager (though a new one is coming out in a few months). What do you mean you have to load subpackages? BioJulia had a package reorganization recently, that may have caused you some issues with packages, is that it? Still that has nothing to do with Julia’s package manager in general.
The main plotting packages in Julia should all support plotting a 1-column matrix as a vector out-of-the-box. What package where you trying to use for plotting? But, you’re right, explicit types (and conversions) is THE one thing you need to learn to make a successful transition from R to Julia.
It’s a good idea to update old stackoverflow questions. I do sometimes search SO for questions marked plots.jl
and for my own answers and update them, and I should do it more. It’s also a good idea to explicitly version julia tutorials (that this isn’t done is really surprising to me).
“Atom is horrible” is just unconstructive. What do you mean? I find it much more powerful than RStudio myself. I’ve disabled automatic completion tooltips using the “tibber” package provided with Juno. (There’s also VSCode if you want to try something different).
“why I can’t just put my column names into a character array” is just a difference between R and Julia. I don’t see any reason that strings should be inherently simpler than symbols as names?
I just read this book on Python:
If there was a Julia equivalent, we would be smooth sailing
Well, you better get writing then!
In this economy?
Here are some other tutorials which explain a number of concepts: GitHub - PaulSoderlind/JuliaTutorial: Julia Tutorial for Finance and Econometrics Students
lol this sounds all too familiar, though I’ve just avoided using dataframes and their pesky :symbols till quite recently. Still not sure how to do matrix multiply on a vector of vectors, but i now know how to extract to a matrix (it’s hiding in the .columns
field )
Kaggle was a quite harsh environment for julia packages though, every run was essentially a start-up run, so as less as possible - where possible, seemed to work… 0 wins isn’t all bad against XGboost and the like
That is why the new tutorial series on youtube is so great. It’s a “let’s get started with …” series that’s very accessible to new comers I think. I hope to do an Optim one soon, but basically any of the major projects should do one if possible.
edit: Alright, “but basically any of the major projects should do one if possible” was sort of harshly interpreted I see. What I meant to say was:
I think that the user facing packages (so maybe not some package that provides a macro that all the developers use) can increase adaption rates if they produce a set of notebooks a post a to-the-point video about what the package does and how you do it.
I find it very difficult to follow video tutorials (compared to text/HTML, or interactive notebooks like Jupyter). If I am learning math, I do it with paper and pencil, if I am learning to code, I have a source code buffer and a REPL open, and follow along. This is just tedious with videos.
Given that they are also orders of magnitude harder to produce (recording equipment, aligning sound and video, making both the speaker and the slides show up OK), and impossible to unit test (eg set up a simple CI environment that alerts when the packages move on and the code breaks), I wonder if they are the best way to go.