I made this presentation focused on the data ecosystem in Julia. I just wanted to post it here for feedback so I can polish it.
Thanks
I made this presentation focused on the data ecosystem in Julia. I just wanted to post it here for feedback so I can polish it.
Thanks
Nice! I especially enjoyed the delving into why group by is fast and implementing an appropriate algorithm. Maybe highlight in the end that Julia is now beating a highly optimized C backed R package.
Yeah itâs only beating data.table in a specific case. In the more general case where we group by more than one column, we donât even have code for that.
Also, I am using 4 threads and only beating data.table by 30%. Additionally data.table is generally faster for smaller group by, only by fractions of a second but still itâs something.
Once Julia can comprehensive beat data.table at everything then itâs time to celebrate.
Are you sure that the history of various missing value implementations (including those now obsolete) is relevant?
Also, I tend to prefer plain vanilla PDF slides (eg with beamer) to Preziâs dizzying zooming around, but thatâs a personal preference.
One of the slides has a misspeling (doesn't
is mispelled), but I donât know how to refer to it.
A small example that does something one would do with eg tidyverse
in R may be enlightening, if the audience is familiar with that.
Thanks! That was enlightening.
Nice, but note that âJulia doesnât have built-in missing valueâ is no longer true:
https://docs.julialang.org/en/latest/manual/missing
Ok so it will be part of 0.7 release then
Here are a few comments on the presentation (I thought the technical content was pretty good).
I like the general style, with a focus on simplicity.
Anyway, it was good!
Content is good but the Prezi format makes it hard to focus on the value provided by the Julia data ecosystem. Iâve seen many technical presentations (pitches) and the ones that stick in my mind are the ones that answer the specific question very early on: What can X do for me now?. Notice the emphasis on the me and now. I donât get that fuzzy feeling when I look at the first few slides of your presentation. Here are some suggestions.
If your audience is comprised of novices, theyâll want to know why to pick Julia as opposed to R or Python. For them, have a single slide that emphasizes the speed aspect of the language, the ease of learning (comparable to Python) and the long term potential. Mention that Julia is here to stay and that very soon it will be a valuable skill to have as a data scientist/developer. Sprinkle in some quotes about the 2 language problem and the ability to do parallel stuff (in contrast to Python/R). Topics like multiple dispatch, lisp style macros, meta-programming should not be mentioned at this stage. It will confuse everyone. In the next slide, show them how they can download Julia and use it asap. There are many options now, Juliabox, docker, binaries, etc.
Use the next few slides to show them how to use their brand new toy. Very concise and practical code snippets that do 1 thing very well. Code should be formatted properly and with large font. Not more than a page per concept. If you can fit a graph, do so. Audience members in the back of the room should be able to read the code on the slide.
In the next few slides, wow them with some more advanced stuff. No theory. Just show them the code and explain what it does. They might not understand all the nuances. Thatâs fine. Thatâs why they call you for further information.
For the more seasoned audience members, focus on the DataFrames ecosystem, JuliaDB and definitely mention all the new Machine/Deep learning frameworks that are coming online. Define a very specific problem and then solve it for them in 3 to 4 slides using DataFrames, pandas and data.table. Show realistic comparisons and mention that all of this is written in native Julia as opposed to C or C++ (2 language problem). Then tell them how in the near future, things will get even faster and better.
I wouldnât bring up anything about the treatment of Nulls, missing values, etc. You can specify relevant links to all these topics on the second to last slide of your presentation. The last slide should be your contact information only. As the expert consultant, if they have any questions, you can answer them all for them.
Good feedbacks already. I had a quick look and the following remarks:
donât think that âinspired by Pythonâ is correct. I would have guessed Matlab, Lisp, Dylan (and Pascal b/c of the (begin)/end⌠) and others
donât think that pointing out interpreted (R/Python) vs. compiled C for speed is important. Iâd say
donât know about DataFrame. Wouldnât tell too much history other than finally there is hope that with 0.7/1.0 everything will fall into its place and be fast
[for quite many things, I think, patience is still in order and I wouldnât âoverhypeâ Julia; Julia is not yet completely polished for âlazy end-user useâ. I wouldnât raise expectations with e.g. julia dataverse: too much flux still, one cannot compare with hadleyverse, tidyverse, shiny, RStudio, ⌠atm. But the important thing is, that the foundations are super-sound afaict and this is what will matter in future. R is great but imho has nowhere to go, it wonât ever be possible to fix the shortcomings (there was a reason one of the founders, Ihaka, said: start over and build something betterâŚ). I donât have too much experience with Python but I donât think Python ever will be able to offer the data scientists a nice concise syntaxâŚ]
If you have some time for background reading Iâd recommend the master thesis of Jeff Bezanson. As a non-computer scientist I found it more approachable than the PhD.
Nice presentation. I think thereâs a bit of fluff at the beginning, it could get to the meat of it faster. I think it can emphasize at the end that the Julia solution means packages written completely in the higher level language, which not only makes it easier to write/maintain, but also makes it easier to add unconventional features like support for weird number types, out-of-core, GPU acceleration of specific algorithms, etc.
I think itâs good to keep an emphasis that Julia itself doesnât make things magically faster since all this other stuff can just be written in C/C++. However, what it allows you to do is match C/C++ inside the same language that youâre scripting with. So since most of the speed comes from implementing intelligent algorithms, Juliaâs advantage isnât really that itâs raw speed. Rather, Juliaâs advantage is that itâs much easier to develop a package with a lot of complicated algorithms in it, and the hope is that overtime the sheer productivity advantage without the performance disadvantage will win.
Is that true in the longrun? For example, even without using TMP, many things in C++ generate faster code than C because it can inline better (e.g. passing function pointers to sorting algorithms relies on smart compilers). Also, isnât the main reason Fortran can be faster than C that it has doesnât have aliasing problems with arrays, which leads to better compiler optimizations? It seems to be that there is no reason Julia canât be faster than both C++ and Fortran! I realize this is about theoretical asymptotics of performance, but (if true!) these are worth disseminating (and encouraging big players to aid in the compiler back-end development).
Julia inlines functions as well. I actually showed the other day:
that part of the reason we can get to the speeds we do with DiffEq is because of specialization on the functions and the inlining that tends to follow. So even though C++ can inline, it cannot inline functions which donât exist at compile time, which then interrupts some of the optimizations when used in this âPython + C++â or âR + C++â setup. So you probably get those back when using C++ directly, but not when using C++ through a scripting language. However, Julia does get naturally get this boost. This is very helpful not just in optimization or DiffEq, but also for things like maps, find
, search
, etc.
Julia can implement higher optimizations in local scopes via macros. This has already been suggested:
Since this would be a feature addon, itâs not v1.0 material. But I hope to see this in a v1.x .
There are a few cases where I have found it hard to get Julia to 1x with good C++ code. Usually this difference is due to optimizations turned off due to aliasing, and this is an example we found in the DiffEq chatroom:
This is one example among many that I would point to that give me the following heuristic: getting within 2x of C++ with Julia is easy, getting to 1x is possible but can take work in some cases. âTaking workâ is generally avoiding things which cannot optimize due to aliasing, and avoiding the fact that views do not stack-allocate. But both of these can be worked around, and both of these are fixable.
Even if you donât get that, 2x performance gains are comparing between the same algorithm. The kicker is that itâs much easier to write a very complex algorithm in Julia, and in many cases the gains from that are >>2x.
Thatâs why Iâve been disseminating it. My current view of Julia is very package-forward, essentially saying âwhy Julia is amazing is because it gives package developers an insane amount of productivity without sacrificing performance, yet in the end itâs also an easy scripting language with a REPL that you can give to an undergrad and have them punch in numbersâ. I think Juliaâs âwinning strategyâ is thus not by arguing about whether language internals are helpful, but by using the productivity advantage to build comprehensive and performant packages. This idea is expanded upon in this post:
This is getting somewhat away from the OP so if you want to continue we should do so in another thread.