Review of presentation

I made this presentation focused on the data ecosystem in Julia. I just wanted to post it here for feedback so I can polish it.

Thanks

1 Like

Nice! I especially enjoyed the delving into why group by is fast and implementing an appropriate algorithm. Maybe highlight in the end that Julia is now beating a highly optimized C backed R package.

Yeah it’s only beating data.table in a specific case. In the more general case where we group by more than one column, we don’t even have code for that.

Also, I am using 4 threads and only beating data.table by 30%. Additionally data.table is generally faster for smaller group by, only by fractions of a second but still it’s something.

Once Julia can comprehensive beat data.table at everything then it’s time to celebrate.

3 Likes

Are you sure that the history of various missing value implementations (including those now obsolete) is relevant?

Also, I tend to prefer plain vanilla PDF slides (eg with beamer) to Prezi’s dizzying zooming around, but that’s a personal preference.

One of the slides has a misspeling (doesn't is mispelled), but I don’t know how to refer to it.

A small example that does something one would do with eg tidyverse in R may be enlightening, if the audience is familiar with that.

1 Like

Thanks! That was enlightening.

Nice, but note that “Julia doesn’t have built-in missing value” is no longer true:
https://docs.julialang.org/en/latest/manual/missing

3 Likes

Ok so it will be part of 0.7 release then

Here are a few comments on the presentation (I thought the technical content was pretty good).

I like the general style, with a focus on simplicity.

  • What’s the D for?
  • Some of the slides look like they haven’t been given the graphical treatment. Eg The Fast slide (about 6 in) could be made more consistent with the others. The next slide probably has too much information on it. Are there some templates that could provide some consistency?
  • The code samples should be careful with formatting — those gray comments are difficult to read.
  • URLs are hard to read if they’re underlined, and if you can’t click on them (Prezi?) you’ll have to type them in. An idea might be to put a clickable link icon after the URL, or a footnote at the bottom of the slide.
  • Should the graphs have some units/labels? Not important, but the initial impression of the first bar chart was “Julia is wicked fast, it can do 160 per second…” :slight_smile:

Anyway, it was good!

1 Like

Content is good but the Prezi format makes it hard to focus on the value provided by the Julia data ecosystem. I’ve seen many technical presentations (pitches) and the ones that stick in my mind are the ones that answer the specific question very early on: What can X do for me now?. Notice the emphasis on the me and now. I don’t get that fuzzy feeling when I look at the first few slides of your presentation. Here are some suggestions.

If your audience is comprised of novices, they’ll want to know why to pick Julia as opposed to R or Python. For them, have a single slide that emphasizes the speed aspect of the language, the ease of learning (comparable to Python) and the long term potential. Mention that Julia is here to stay and that very soon it will be a valuable skill to have as a data scientist/developer. Sprinkle in some quotes about the 2 language problem and the ability to do parallel stuff (in contrast to Python/R). Topics like multiple dispatch, lisp style macros, meta-programming should not be mentioned at this stage. It will confuse everyone. In the next slide, show them how they can download Julia and use it asap. There are many options now, Juliabox, docker, binaries, etc.

Use the next few slides to show them how to use their brand new toy. Very concise and practical code snippets that do 1 thing very well. Code should be formatted properly and with large font. Not more than a page per concept. If you can fit a graph, do so. Audience members in the back of the room should be able to read the code on the slide.

In the next few slides, wow them with some more advanced stuff. No theory. Just show them the code and explain what it does. They might not understand all the nuances. That’s fine. That’s why they call you for further information.

For the more seasoned audience members, focus on the DataFrames ecosystem, JuliaDB and definitely mention all the new Machine/Deep learning frameworks that are coming online. Define a very specific problem and then solve it for them in 3 to 4 slides using DataFrames, pandas and data.table. Show realistic comparisons and mention that all of this is written in native Julia as opposed to C or C++ (2 language problem). Then tell them how in the near future, things will get even faster and better.

I wouldn’t bring up anything about the treatment of Nulls, missing values, etc. You can specify relevant links to all these topics on the second to last slide of your presentation. The last slide should be your contact information only. As the expert consultant, if they have any questions, you can answer them all for them.

2 Likes

Good feedbacks already. I had a quick look and the following remarks:

  • don’t think that ‘inspired by Python’ is correct. I would have guessed Matlab, Lisp, Dylan (and Pascal b/c of the (begin)/end… :+1: ) and others

  • don’t think that pointing out interpreted (R/Python) vs. compiled C for speed is important. I’d say

    • Julia is fast because the language has been designed for it (incl. accepting compromises in the dynamic capabilities). This allows ‘the language to talk to the compiler’ more directly
    • there have been successful compilation attempts with R/Python, but it’s difficult b/c those languages are very dynamic. But JavaScript for example is fast
  • don’t know about DataFrame. Wouldn’t tell too much history other than finally there is hope that with 0.7/1.0 everything will fall into its place and be fast

  • [for quite many things, I think, patience is still in order and I wouldn’t ‘overhype’ Julia; Julia is not yet completely polished for ‘lazy end-user use’. I wouldn’t raise expectations with e.g. julia dataverse: too much flux still, one cannot compare with hadleyverse, tidyverse, shiny, RStudio, … atm. But the important thing is, that the foundations are super-sound afaict and this is what will matter in future. R is great but imho has nowhere to go, it won’t ever be possible to fix the shortcomings (there was a reason one of the founders, Ihaka, said: start over and build something better…). I don’t have too much experience with Python but I don’t think Python ever will be able to offer the data scientists a nice concise syntax…]

If you have some time for background reading I’d recommend the master thesis of Jeff Bezanson. As a non-computer scientist I found it more approachable than the PhD.

1 Like

Nice presentation. I think there’s a bit of fluff at the beginning, it could get to the meat of it faster. I think it can emphasize at the end that the Julia solution means packages written completely in the higher level language, which not only makes it easier to write/maintain, but also makes it easier to add unconventional features like support for weird number types, out-of-core, GPU acceleration of specific algorithms, etc.

I think it’s good to keep an emphasis that Julia itself doesn’t make things magically faster since all this other stuff can just be written in C/C++. However, what it allows you to do is match C/C++ inside the same language that you’re scripting with. So since most of the speed comes from implementing intelligent algorithms, Julia’s advantage isn’t really that it’s raw speed. Rather, Julia’s advantage is that it’s much easier to develop a package with a lot of complicated algorithms in it, and the hope is that overtime the sheer productivity advantage without the performance disadvantage will win.

Is that true in the longrun? For example, even without using TMP, many things in C++ generate faster code than C because it can inline better (e.g. passing function pointers to sorting algorithms relies on smart compilers). Also, isn’t the main reason Fortran can be faster than C that it has doesn’t have aliasing problems with arrays, which leads to better compiler optimizations? It seems to be that there is no reason Julia can’t be faster than both C++ and Fortran! I realize this is about theoretical asymptotics of performance, but (if true!) these are worth disseminating (and encouraging big players to aid in the compiler back-end development).

Julia inlines functions as well. I actually showed the other day:

that part of the reason we can get to the speeds we do with DiffEq is because of specialization on the functions and the inlining that tends to follow. So even though C++ can inline, it cannot inline functions which don’t exist at compile time, which then interrupts some of the optimizations when used in this “Python + C++” or “R + C++” setup. So you probably get those back when using C++ directly, but not when using C++ through a scripting language. However, Julia does get naturally get this boost. This is very helpful not just in optimization or DiffEq, but also for things like maps, find, search, etc.

Julia can implement higher optimizations in local scopes via macros. This has already been suggested:

Since this would be a feature addon, it’s not v1.0 material. But I hope to see this in a v1.x :slight_smile:.

There are a few cases where I have found it hard to get Julia to 1x with good C++ code. Usually this difference is due to optimizations turned off due to aliasing, and this is an example we found in the DiffEq chatroom:

This is one example among many that I would point to that give me the following heuristic: getting within 2x of C++ with Julia is easy, getting to 1x is possible but can take work in some cases. “Taking work” is generally avoiding things which cannot optimize due to aliasing, and avoiding the fact that views do not stack-allocate. But both of these can be worked around, and both of these are fixable.

Even if you don’t get that, 2x performance gains are comparing between the same algorithm. The kicker is that it’s much easier to write a very complex algorithm in Julia, and in many cases the gains from that are >>2x.

That’s why I’ve been disseminating it. My current view of Julia is very package-forward, essentially saying “why Julia is amazing is because it gives package developers an insane amount of productivity without sacrificing performance, yet in the end it’s also an easy scripting language with a REPL that you can give to an undergrad and have them punch in numbers”. I think Julia’s “winning strategy” is thus not by arguing about whether language internals are helpful, but by using the productivity advantage to build comprehensive and performant packages. This idea is expanded upon in this post:

This is getting somewhat away from the OP so if you want to continue we should do so in another thread.

1 Like