Teaching data analysis with Julia - what to do about DataFrames and all that?

dmbates · November 18, 2016, 9:42pm

I will be teaching a couple of workshops on fitting mixed-effects models with Julia early in 2017. The audience will know some R but probably not know any Julia. Some may know about dplyr and Hadley Wickham’s approach as described at http://tidyverse.org but most will be scientists who just want to get their analysis done quickly and efficiently.

The biggest problem I am facing in preparing this course is what to say about reading the data and performing elementary data manipulation. I am writing this comment partly out of frustration with the status quo so I may come off as more negative than I should if I want to be helpful. I know that many contributors have done a lot of wonderful work and I do appreciate it. Julia is a fantastic language and system. I feel privileged to benefit from the hard work of so many. I would like to convince others to use it and that is the basis of my comments. I don’t want things that are easy to do in R or Python to seem so unwieldy in Julia that users decide not to embark on learning Julia.

If I start with a CSV file and have a few manipulations to perform I can call readtable with makefactors=true and describe to the students how to work with DataArrays and PooledDataArrays. However, it is likely that this approach will be deprecated by the time I get around to the second workshop in April, 2017.

As I understand it the preferred approach will be to use the CSV, NullableArrays and CategoricalArrays packages to do these tasks. At present, however, the process of doing so is roundabout and difficult to explain. I fear that the inability to do simple data manipulation tasks without jumping through a lot of hoops is going to turn everyone off and they will leave only learning that Julia is too complicated for them to use.

Let’s start with a simple case, I have a CSV file that in R would be read as columns of integers, floating point values and factors whose levels are strings.

If I use CSV.read I get

julia> using CSV, CategoricalArrays, DataFrames

julia> behavior = CSV.read("behavioral_task_data.csv", header=1);

julia> size(behavior)
(14994,12)

julia> for (n,v) in eachcol(behavior)
           println(rpad(n, 24), typeof(v))
       end
Trial                   NullableArrays.NullableArray{Int64,1}
GoNoGo_Group            NullableArrays.NullableArray{WeakRefString{UInt8},1}
HandDecision            NullableArrays.NullableArray{WeakRefString{UInt8},1}
GoNoGoDecision          NullableArrays.NullableArray{WeakRefString{UInt8},1}
Subject                 NullableArrays.NullableArray{Int64,1}
Accuracy                NullableArrays.NullableArray{Int64,1}
Gender                  NullableArrays.NullableArray{WeakRefString{UInt8},1}
InitialSound            NullableArrays.NullableArray{WeakRefString{UInt8},1}
SUBTLEX_LogFrequency    NullableArrays.NullableArray{Float64,1}
Syllables               NullableArrays.NullableArray{Int64,1}
Item                    NullableArrays.NullableArray{WeakRefString{UInt8},1}
GoNoGo                  NullableArrays.NullableArray{WeakRefString{UInt8},1}

(By the way, I seem to remember there being a function to do what I did in that loop but I can’t find it now. Can someone refresh my memory?)

So now I get to explain about NullableArrays.NullableArray and WeakRefString or I could just tell everyone not to pay attention to these names. I don’t know a priori if there are any missing data values in the CSV file. I can check using anynull or I can just try CSV.read with the additional argument nullable=false and see if it throws an error.

julia> behavior = CSV.read("behavioral_task_data.csv", header=1, nullable=false);

julia> for (n,v) in eachcol(behavior)
           println(rpad(n, 24), typeof(v))
       end
Trial                   Array{Int64,1}
GoNoGo_Group            Array{String,1}
HandDecision            Array{String,1}
GoNoGoDecision          Array{String,1}
Subject                 Array{Int64,1}
Accuracy                Array{Int64,1}
Gender                  Array{String,1}
InitialSound            Array{String,1}
SUBTLEX_LogFrequency    Array{Float64,1}
Syllables               Array{Int64,1}
Item                    Array{String,1}
GoNoGo                  Array{String,1}

Okay, we are good to go in this case but I want CategoricalVectors, not Vector{String}s and there is no makefactors argument to CSV.read. I could try to create such a column but I get an error

julia> behavior[:GoNoGof] = categorical(behavior[:GoNoGo_Group])
ERROR: MethodError: no method matching upgrade_vector(::CategoricalArrays.CategoricalArray{String,1,UInt32})
Closest candidates are:
  upgrade_vector(::BitArray{1}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:349
  upgrade_vector(::Array{T,1}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:347
  upgrade_vector(::Range{T}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:348
  ...
 in setindex!(::DataFrames.DataFrame, ::CategoricalArrays.CategoricalArray{String,1,UInt32}, ::Symbol) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:364

The only way I know how to do this is to go directly to the columns member of the DataFrame but I don’t want to teach that because
a) It is not a good practice to go around manipulating the contents of a member of an instance of a type
b) It is going to be very confusing if I want to use names, not positions of columns

I could teach

julia> cols = behavior.columns;

julia> for i in eachindex(cols)
           icol = cols[i]
           if eltype(icol) == String
               cols[i] = categorical(icol)
           end
       end

julia> for (n, v) in eachcol(behavior)
           println(rpad(n, 25), typeof(v))
       end
Trial                    Array{Int64,1}
GoNoGo_Group             CategoricalArrays.CategoricalArray{String,1,UInt32}
HandDecision             CategoricalArrays.CategoricalArray{String,1,UInt32}
GoNoGoDecision           CategoricalArrays.CategoricalArray{String,1,UInt32}
Subject                  Array{Int64,1}
Accuracy                 Array{Int64,1}
Gender                   CategoricalArrays.CategoricalArray{String,1,UInt32}
InitialSound             CategoricalArrays.CategoricalArray{String,1,UInt32}
SUBTLEX_LogFrequency     Array{Float64,1}
Syllables                Array{Int64,1}
Item                     CategoricalArrays.CategoricalArray{String,1,UInt32}
GoNoGo                   CategoricalArrays.CategoricalArray{String,1,UInt32}

but I suspect that most students will start looking at their smartphones at that point having decided the Julia is just too difficult to work with.

The other example I am working on is even worse because there are missing data values in the .csv file and the analysis in R used Z-scores of some of the covariates. What we get is

julia> perception = CSV.read("perception_study_data_NA.csv", header = 1, null = "NA");

julia> for (n, v) in eachcol(perception)
           println(rpad(n, 30), typeof(v))
       end
SYLL_NUM                      NullableArrays.NullableArray{Int64,1}
SENTENCE                      NullableArrays.NullableArray{WeakRefString{UInt8},1}
LABEL                         NullableArrays.NullableArray{WeakRefString{UInt8},1}
FUNCTION                      NullableArrays.NullableArray{WeakRefString{UInt8},1}
EXPERIMENT                    NullableArrays.NullableArray{WeakRefString{UInt8},1}
RIGHTEDGE                     NullableArrays.NullableArray{Int64,1}
PRIMARY                       NullableArrays.NullableArray{Int64,1}
PRIMARY_STRING                NullableArrays.NullableArray{WeakRefString{UInt8},1}
SYLL_MAXF0                    NullableArrays.NullableArray{Float64,1}
SYLL_MAXF0.in.Semitones       NullableArrays.NullableArray{Float64,1}
SYLL_MINF0                    NullableArrays.NullableArray{Float64,1}
SYLL_MINF0.in.Semitones       NullableArrays.NullableArray{Float64,1}
SYLL_EXCUR_SIZE               NullableArrays.NullableArray{Float64,1}
SYLL_MEANF0                   NullableArrays.NullableArray{Float64,1}
SYLL_MEANF0_ST                NullableArrays.NullableArray{Float64,1}
SYLL_MEAN_INT                 NullableArrays.NullableArray{Float64,1}
SYLL_DUR                      NullableArrays.NullableArray{Float64,1}
SYLL_DUR_SECS                 NullableArrays.NullableArray{Float64,1}
SYLL_F0_OVER_MEAN_SENT_F0     NullableArrays.NullableArray{Float64,1}
SYL_DUR_OVER_SENT_DUR         NullableArrays.NullableArray{Float64,1}
SYLL_INT_OVER_SENT_MEAN_INT   NullableArrays.NullableArray{Float64,1}
SUBJECT                       NullableArrays.NullableArray{WeakRefString{UInt8},1}
USER_RESP                     NullableArrays.NullableArray{Int64,1}
ITEM                          NullableArrays.NullableArray{Int64,1}
ONEBACK                       NullableArrays.NullableArray{WeakRefString{UInt8},1}

julia> zscore(perception[:SYLL_MAXF0])
ERROR: MethodError: no method matching zscore(::NullableArrays.NullableArray{Float64,1})
Closest candidates are:
  zscore{T<:Real}(::AbstractArray{T<:Real,N}, ::Int64) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:396
  zscore{T<:Real}(::AbstractArray{T<:Real,N}, ::Real, ::Real) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:385
  zscore{T<:Real,U<:Real,S<:Real}(::AbstractArray{T<:Real,N}, ::AbstractArray{U<:Real,N}, ::AbstractArray{S<:Real,N}) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:390
  ...

I know that I can evaluate the Z-scores by first converting the column to an Array

julia> perception[:SYLL_MAXF0_Z] = zscore(Array(perception[:SYLL_MAXF0]))

which will then conveniently be converted to a DataArray, even though there are, directly as a result of the way it was created, no missing values.

julia> typeof(perception[:SYLL_MAXF0_Z])
DataArrays.DataArray{Float64,1}

It may be possible to use one of the DataFramesMeta , Query or StructuredQueries packages to phrase this as a transform but the only one of these I have ever been able to use successfully is DataFramesMeta, and even that is going to be kind of complicated to teach.

I know we are in the middle of a transition but we have been for a long time. I believe that the first SoC project on Nullables, etc. was in 2015 and the grant from the Moore Foundation to enhance statistical computing capabilities was about a year ago,

It is good to have a long term vision but I think we have a “best is the enemy of the good” problem here. We can’t describe to potential users how to go about some pretty basic data manipulation tasks because we are still thinking about the optimal “Brave New World” kind of structure.

I think it would be good, in addition to formulating grand plans, to also do some case studies and see how convenient it is to use Julia for practical data input and data manipulation as compared to R or Python. I am well aware of the difficulties of trying to direct open-source development - I am frequently guilty of “If you want that capability why don’t you write your own damn software?” responses. However, my understanding is that some direction of statistical computing capabilities was part of the purpose of the Moore Foundation grant and it seems to me that some kind of overview of how capabilities mesh together would fall under that purpose.

tshort · November 18, 2016, 10:06pm

You can use dump(behavior) to get output like that first loop.

I’m also struggling with what to do about the ongoing transition or how to help, so I’m in wait-n-see mode.

nalimilan · November 18, 2016, 10:13pm

At this point I think you’d better stick to DataArray and readtable if you want to keep things simple. But I would use DataFramesMeta.jl and Query.jl (StructuredQueries.jl isn’t ready yet), since these high-level APIs are less likely to change than the indexing approach after the port to Nullable.

The next few months are really going to be frustrating, with all the new framework not being completely usable yet despite our will to use it. You’ve read the announcement, right?

akis · November 18, 2016, 10:52pm

Frustration is understandable (as long as it remains civilized, not directed against specific persons, and open to return). I trust that people more experienced than myself in data analysis with Julia can provide various ways to easy the pain. However, considering the suggestion at the end of the OP, I’d like to take the opportunity to clear up that there is nothing wrong with Julia language in this case (except maybe the Nullable concept, but even that is an optional feature of the language).

Other languages don’t do better on this front. The difference of easiness is clearly a matter of the available packages/libraries and their own maturity. Of course all libraries depend on the language they get written in, but they are the ones to adapt to the language, not the other way around. Even more considering that Julia is still in beta stage.

Therefore, this is not a good enough reason to change the language’s high ambitions, and we should not expect any change from “best” to “good” happening anytime before Julia 1.0. This is a trade-off decision which necessarily leaves many people sad, but that’s the nature of trade-offs. Same with the choice between current Julia and some other language. We have to work with what is available at the time.

ChrisRackauckas · November 18, 2016, 11:06pm

I totally agree. I teach Julia as “it’s still not v1.0, there are rough edges” because there are many package ecosystem changes like this. While the language is pretty stable now, “the language” isn’t what most people consider “Julia”, it’s instead the language + dataframes, plotting, basic ODE solvers, etc. which are still in flux. For this reason I still tend to teach Julia at more of a “here’s how to build things” level instead of a “here’s how to use things” because things will inevitably change.

And I am very happy they will. I wouldn’t want anyone stopping early and calling it a day. The reason why we are learning/using Julia is to have “close to C, maybe faster” tools. I think stopping short because the language got some adoption pre-1.0 isn’t a good idea: we should still be willing to break things to make them better.

dmbates · November 18, 2016, 11:17pm

@nalimilan Yes, I have read the announcement. I know the situation is expected to improve in the future but, unfortunately, the timing of these workshops was not completely in my control.

For this particular task I may fall back on showing how to use the RCall package’s facilities (which are fantastic, thanks to the work of Simon and Randy). That is, read the data and perform the data manipulation in R then import the data frame into Julia. That might be less embarrassing if I hadn’t spend the last four years saying that, as a language, R is kind of clunky compared to Julia . It is as @ChrisRackauckas said, what people experience is the combination of language and packages. Developing both at a high level takes time and patience.

dmbates · November 18, 2016, 11:22pm

I thought I remembered a function with a name like coltypes but it doesn’t exist now.

amellnik · November 18, 2016, 11:59pm

I’ll echo @nalimilan and recommend that you pin to DataFrames 0.8.3 with the old DataArrays API. The plan is for the new framework to be released in February 2017, and I think it will take a few months beyond that for things to really start running smoothly. Even then, it’s still unclear how painful it will be to use it.

swissr · November 19, 2016, 12:01am

Hello Akis, do you realise that Doug Bates is/was a core R developer (and before S)? He is very very experienced. See e.g. the JuliaCon 2015 “Adventures with Statistical Models and Sparse Matrices” (7:55). I think it’s a huge problem that there is no reliable data frame. There are good old articles from John Myles White “The state of Statistics in Julia” and “What’s wrong with Statistics in Julia”. Nullable is not optional (imho) but a fundamental question to be solved (in statistics there are NA values and you need to account for them). It is a very difficult problem though, there are several blog posts about possible approaches.

I think Julia cannot compete on statistics (if it is not about millions of data rows which take hours) as long as the data frame / missing data issues are not solved. And it likely would be better to do the course in R. - Having watched the mentioned video (and heard: that’s why Julia is awesome), I could speculatively imagine that not be able to do the course in Julia would be a tremendous frustrating thought.

(Hope I was not too unfriendly/short/‘wrong-assuming’, it’s already (too) late here… )

akis · November 19, 2016, 12:37am

Let me rephrase to remove the possible confusion (I also edited my post):

I trust that people more experienced than myself in data analysis with Julia can provide various ways to easy the pain.

As you said, there is much discussion around Nullable, so I won’t repeat it here. The bottom line is that the problem Nullable tries to solve is not optional, only the specific solution is optional. But Julia is much more than a language for Statistics, it’s truly a general-purpose language. And we all have problems from incomplete or unstable or even not-yet-existing packages. I really miss an operating system written in Julia to begin with. Can I have it by the end of the year for my next “killer app”?

I predict that one day Julia will dominate Statistics too, with or without Nullable or any secondary feature. Still it’s up to package builders to do most of the work to that effect, even by simply translating existing packages from other languages and sacrificing the potential for huge gains over them. I wish them the best, but I won’t hide my objection to any suggestion of limiting Julia within R’s world.

nalimilan · November 19, 2016, 11:09am

It’s called eltype now.

Going back to your concern about “the best is the enemy of the good”, I think your plan is not so long-term: the new framework should be ready in 2-3 months. What’s the enemy of the good is that you (just like most of us – we’re greedy) are trying to mix the current DataArray-based DataFrames 0.8 with NullableArray and CategoricalArray, which is the recipe for getting lots of errors (as you showed).

In particular, CSV.jl creates NullableArray columns, which requires a lot of care to work with correctly at the moment. That’s why I think you’d better use readtable to show a more consistent and simpler framework for beginners, which is quite similar to R. After all, we’ve lived with that for several years, so it’s not that bad.

Of course, any additional manpower to help improving the DataFrames ecosystem would always be welcome. No idea what the plans of those who control the money are in that regard.

I’m sorry to say this a bit bluntly, but this sounds like a completely uninformed comment. We are currently improving Nullable support in Julia Base, and some of these improvements will appear in 0.6.

Anyway, please keep threads focused on the original poster’s questions. @dmbates didn’t ask for a general discussion on what we should expect from a language like Julia.

Tamas_Papp · November 19, 2016, 3:06pm

Honestly, currently (and in early 2017) you might be better off teaching this course using R if you don’t want to risk your audience getting distracted by the API being in transition.

I am convinced that this part of the library ecosystem is evolving towards something that will be very nice to use eventually, but at the moment, I am afraid that explaining how to work around the rough edges just takes time away from the substance of the course. Also, the actual API you teach them will not be the same in a few months time, so you might as well teach the principles in in R, which they can adapt to Julia later on.

johnmyleswhite · November 19, 2016, 4:38pm

I want to second Tamas’ suggestion. I would avoid teaching Julia as a general purpose data analysis environment until Julia has matured more. I might use Julia to teach students about writing efficient code for computationally intensive tasks, but most students won’t be doing that until they already know basic R quite well.

dmbates · November 19, 2016, 5:23pm

I should have been clearer in my first sentence of the OP. By “teaching a couple of workshops on fitting mixed-effects models with Julia” I meant that the name of the workshop is “Fitting Mixed-effects Models with Julia”. Teaching such a workshop using R would be, well, unusual.

akis · November 19, 2016, 5:29pm

@dmbates spent a good part of his post questioning Julia’s current philosophy. My answer was right on that and I wouldn’t have reposted, if people didn’t mention it again and again.

For a second time in a single thread, people misread my words against the Julia community’s rule of “interacting on the basis of good faith”. Improving Nullable support is not a change from “best” to “good”, so my comment is accurate, enough for @ChrisRackauckas to write:

Anyway, thanks for the hospitality, I’m out of here.

dmbates · November 19, 2016, 5:34pm

Of course, as I mentioned, I can use the RCall package and do the data input and manipulation in R then do the model fitting in Julia. I’ll get started on a “If you like your tidyverse, you get to keep your tidyverse” slide.

Tamas_Papp · November 19, 2016, 6:41pm

In that case, I would follow this suggestion:

and possibly use datasets without missing values. I don’t know how much this interferes with your course plan though, if data preparation is emphasized then you may be in a difficult position for the moment.

mkborregaard · November 21, 2016, 8:05am

Thanks for bringing this question up, and for the discussion that has ensued. I am currently holding off with transitioning my “Ecological data analysis with R” course to Julia, though I have transitioned almost completely myself. It is good to see this decision validated by the Stats developers.
Another thing that makes me happy is to see that these concerns are taken seriously. I love julia as a data-analytical language and really hope that the brave new world will be as great for data as one could hope.

nalimilan · November 21, 2016, 9:16am

When the new framework is starting to get ready, it will be really useful to get feedback about what works and what’s missing for this kind of course. That’s a good way of detecting API holes.

Topic		Replies	Views
DataTables or DataFrames? Data question	32	15436	November 19, 2018
How do DataFrames.jl compare to R's? And Interoperability between R and Julia General Usage	23	6558	January 3, 2018
Announcement: An Update on DataFrames Future Plans Data announcement	41	9326	December 27, 2017
Is there light at the end of the DataFrames tunnel? Data question	36	4339	November 24, 2017
Suggestion: move DataFrames, plotting into standard distribution Internals & Design proposal , plotting , dataframes	45	3934	February 21, 2018

Teaching data analysis with Julia - what to do about DataFrames and all that?

Related topics