I understand that if you write type stable code in Julia then it’s easier to optimize and hence will result in fast running code.
But I am thinking about this in the DataFrames context, say I have a function that takes as argument a dataframe and output a dataframe always. Now the output is type-stable since it’s alway outputing a dataframe. But I can output datarames with different columns each time. That should be type-unstable right?
I am a little confused here, does it mean that by design Julia can’t optimize code that outputs Dataframes because to do that it need to know the type of each column, but the type information is not encoded into the definition of a dataframe, hence even if a function is type-stable in that it produces a dataframe always, it will still be slow as the columns’s type may change from run to run depending on inputs?
Your question is relevant - it’s been one of the main issues with designing DataFrames in Julia. The short answer is that with the new versions of DataFrames coming out soon, they will be type-stable.
Yes, that’s indeed a significant issue. The TypedTables package provides an alternative to DataFrame which includes column types in the table type. That ensures that specialized code can be compiled for each set of column types. This is an advantage when you work with tables which all have the same shape.
But it can also be a problem if you work with ots of very different tables, as is often the case if you select subsets of variables during various operations. In that case, you’ll have to recompile all functions each time the columns change, which could slow down operations rather than make them faster (introducing a fixed cost).
An intermediary approach, used for example by DataFrames master in the joining and grouping code, is to separate kernel functions which perform the intensive operations, and have them take a tuple of columns rather than a DataFrame. That way, these functions are specialized, but the code wrapping them, which wouldn’t benefit so much from specialization, isn’t. We could probably apply this approach in other places. The question is always whether the time required for compilation is worth it. The tradeoff is clearly positive when an operation can be applied separately to each column, since the number of different column types is limited. It’s less obvious when specializing on a combination of column types, since their number can be very high (and is dependent on order of columns). Only experimentation can tell.
Actually, they won’t. DataFrames master isn’t very different from the current release in that regard. Use TypedTables if you need a fully type-stable approach.
It’s not easy to keep track… I believe IndexedTables is also a type-stable alternative. It is true that there is a bit of recompilation cost, but if you’re interested in working with a dataset with a lot of rows, it’s probably worth it.
There’s no clear plan in that regard in the short term AFAIK. We have enough work on our plate (too much, actually, given the size of the team) with the port to Nulls. Many people are interested by the perspectives opened by NamedTuple (which is already used by Query to represent rows), though, so a few experimentations will likely happen, e.g. via alternative data frame types.
The people working on the different table implementations in Julia are truly moving mountains at the moment. This is really one of those areas in Julia where the user/developer ratio suffers, mainly because almost everybody uses tables. Thank God the group of developers are first-class.
I fully agree with @mkborregaard : the data developers are doing an insane amount of work!
Concerning “typed-ness”, I believe that one issue with the current DataFrame implementation (unless of course I’m missing something) is that it’s difficult to do map and filter on a DataFrame performantly, even though I believe those are basic manipulations. By map I mean something that takes a DataFrame and a function from named tuples to named tuples and outputs a DataFrame, and filter would take a a DataFrame and a predicate on named tuples. One can of course resort to external packages (such as Query), but even there the @select statement, which would be what here I call map has some limitations as it relies on type inference on NamedTuples to work.
Still, there has been a lot of discussion and interesting ideas on these topics and I’m curious to see what the outcome will be.
From a person coming from R and SAS I think solving the uncertainty around dataframe type-unstability and doing it using an unified solution is key. Currently it’s hard for me to use Julia just for data manipulation like I can with R; partly because there are so many different dataframes packages out there.
One good thing is that there is alot of experimentation going on in Julia but for wide adoption of Julia in the data community there needs to be a stable and unified package for data maniipulation.
I think the ideas in IndexedTable, DataFrames etc should not be mutually exclusive, e.g. I do want to be able to index any column in a dataframe.
I’m sure you will be a welcome addition to the team. You should be aware that noone working on DataFrames receive any salary to do so – it’s purely free voluntary work given as a gift to the julia community.
Everyone is well aware of the current incomplete state of table support in comparison to R, and the importance of it for the language. It’s just in development. The final result might just be even better than what R (python, Matlab, SAS) offers, IMHO.
Even though I would love to be full time employed and work on dataframes and related tech, I understand that my level of Julia is not “employable” yet. Anyways, in my most recent benchmark, R data.table was 3 times faster than Julia’s IndexedTable for a task. And it’s not clear how to accomplish certain tasks in Julia vs data.table, hence clearly there is a lot to do still in Julia. I have to say, it’s clear how Julia can make compute-tasks faster, but it’s not so clear how it can make dataframes tasks faster.
It would be ideal to have more resources put into Julia in terms of dataframe development, and it’s unrealistic to rely on unpaid volunteers to do long term development. Most open source development are done by paid developers I would say. Anyway, hopefully one day I will be good enough and a position will open up so I can work on it full time and quit the consulting work which is still using SAS — pain! I am working 40 hours on a consulting engagement and 20 hours+ on product development. That’s about 60+hours a week. I am trying my best to contribute to the Julia ecosystem as I already use Julia to do a number of things, but the best I can do now is to file bug reports and do the occassionaly simple PRs
“If you aren’t happy with the documentation, please submit some patches. We aren’t getting paid to write documentation for you”
can’t blame either side. Most panda dev are not being paid there are others benefitting from open source but not contributing back.
solution? every party that benefit from open source directly should give back in some ways. if not monetary then some well researched and helpful bug report.
One option to encourage development is one of the funding models Open Whisper Systems uses to develop their Signal app, BitHub. BitHub is a combination of Bitcoin and GitHub to fund commits. Donations go to a common pot and each commit is apportioned a certain amount of Bitcoin.
I think you’d be interested in reading @ChrisRackauckas’s recent blog post. In particular, I think Chris hits the nail on the head by saying something along the lines of “don’t expect Julia to be faster than traditional C libraries used in other languages; while Julia certainly can achieve performance parity, it’s also much younger and offers many other benefits like more productive development times and abilities to scale long-term”.
In particular, data.table is an extremely well-established C library that has been optimized for specific operations for years. It’s a wonderful resource! It was my favorite R package (back when I used R). DataFrames is certainly not ready to compete w/ the hand-optimized code of data.table in it’s current form. But there are a lot of positive signs in the Julia data-processing ecosystem: JuliaDB/IndexedTables provide a DataFrame-like table that can automatically scale across julia processes, DataStreams provides completely decoupled IO operations between various data formats (i.e. you can move data between CSV, feather files, databases, DataFrames, etc. extremely fast and efficiently w/o having to load data into memory as an intermediate step), and Query.jl provides a custom “data DSL” for doing SQL/dplyr-like operations directly on a variety of data “backends”/formats.
Just those three examples are exciting and provide an example of the power and productivity of Julia and how it can influence data operations as technical computing continues to evolve. The future is exciting and I think optimizations will continue to roll in, but I’m more excited about the kind of scale and modularity that’s already being put in place in Julia (which IMO, you can’t find in other languages).
Since the subject has come up, I am curious. Is there a long term strategy to build funding for full time julia developers? Do we just hope that big finance companies become reliant on Julia and Julia Computing will hire full time devs on important packages with new money from consulting fees?
Would companies who use julia ever contribute to a larger pot for julia development like you see with the Linux kernal? Or is Julia too small of a market to get that kind of support? I mean, if the pandas developers never got to work full time, is there any hope for Julia developers doing the same?