I will be teaching a couple of workshops on fitting mixed-effects models with Julia early in 2017. The audience will know some R
but probably not know any Julia
. Some may know about dplyr
and Hadley Wickham’s approach as described at http://tidyverse.org but most will be scientists who just want to get their analysis done quickly and efficiently.
The biggest problem I am facing in preparing this course is what to say about reading the data and performing elementary data manipulation. I am writing this comment partly out of frustration with the status quo so I may come off as more negative than I should if I want to be helpful. I know that many contributors have done a lot of wonderful work and I do appreciate it. Julia is a fantastic language and system. I feel privileged to benefit from the hard work of so many. I would like to convince others to use it and that is the basis of my comments. I don’t want things that are easy to do in R or Python to seem so unwieldy in Julia that users decide not to embark on learning Julia.
If I start with a CSV file and have a few manipulations to perform I can call readtable
with makefactors=true
and describe to the students how to work with DataArrays
and PooledDataArrays
. However, it is likely that this approach will be deprecated by the time I get around to the second workshop in April, 2017.
As I understand it the preferred approach will be to use the CSV
, NullableArrays
and CategoricalArrays
packages to do these tasks. At present, however, the process of doing so is roundabout and difficult to explain. I fear that the inability to do simple data manipulation tasks without jumping through a lot of hoops is going to turn everyone off and they will leave only learning that Julia is too complicated for them to use.
Let’s start with a simple case, I have a CSV file that in R
would be read as columns of integers, floating point values and factors whose levels are strings.
If I use CSV.read
I get
julia> using CSV, CategoricalArrays, DataFrames
julia> behavior = CSV.read("behavioral_task_data.csv", header=1);
julia> size(behavior)
(14994,12)
julia> for (n,v) in eachcol(behavior)
println(rpad(n, 24), typeof(v))
end
Trial NullableArrays.NullableArray{Int64,1}
GoNoGo_Group NullableArrays.NullableArray{WeakRefString{UInt8},1}
HandDecision NullableArrays.NullableArray{WeakRefString{UInt8},1}
GoNoGoDecision NullableArrays.NullableArray{WeakRefString{UInt8},1}
Subject NullableArrays.NullableArray{Int64,1}
Accuracy NullableArrays.NullableArray{Int64,1}
Gender NullableArrays.NullableArray{WeakRefString{UInt8},1}
InitialSound NullableArrays.NullableArray{WeakRefString{UInt8},1}
SUBTLEX_LogFrequency NullableArrays.NullableArray{Float64,1}
Syllables NullableArrays.NullableArray{Int64,1}
Item NullableArrays.NullableArray{WeakRefString{UInt8},1}
GoNoGo NullableArrays.NullableArray{WeakRefString{UInt8},1}
(By the way, I seem to remember there being a function to do what I did in that loop but I can’t find it now. Can someone refresh my memory?)
So now I get to explain about NullableArrays.NullableArray
and WeakRefString
or I could just tell everyone not to pay attention to these names. I don’t know a priori if there are any missing data values in the CSV file. I can check using anynull
or I can just try CSV.read
with the additional argument nullable=false
and see if it throws an error.
julia> behavior = CSV.read("behavioral_task_data.csv", header=1, nullable=false);
julia> for (n,v) in eachcol(behavior)
println(rpad(n, 24), typeof(v))
end
Trial Array{Int64,1}
GoNoGo_Group Array{String,1}
HandDecision Array{String,1}
GoNoGoDecision Array{String,1}
Subject Array{Int64,1}
Accuracy Array{Int64,1}
Gender Array{String,1}
InitialSound Array{String,1}
SUBTLEX_LogFrequency Array{Float64,1}
Syllables Array{Int64,1}
Item Array{String,1}
GoNoGo Array{String,1}
Okay, we are good to go in this case but I want CategoricalVector
s, not Vector{String}
s and there is no makefactors
argument to CSV.read
. I could try to create such a column but I get an error
julia> behavior[:GoNoGof] = categorical(behavior[:GoNoGo_Group])
ERROR: MethodError: no method matching upgrade_vector(::CategoricalArrays.CategoricalArray{String,1,UInt32})
Closest candidates are:
upgrade_vector(::BitArray{1}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:349
upgrade_vector(::Array{T,1}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:347
upgrade_vector(::Range{T}) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:348
...
in setindex!(::DataFrames.DataFrame, ::CategoricalArrays.CategoricalArray{String,1,UInt32}, ::Symbol) at /home/bates/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:364
The only way I know how to do this is to go directly to the columns
member of the DataFrame
but I don’t want to teach that because
a) It is not a good practice to go around manipulating the contents of a member of an instance of a type
b) It is going to be very confusing if I want to use names, not positions of columns
I could teach
julia> cols = behavior.columns;
julia> for i in eachindex(cols)
icol = cols[i]
if eltype(icol) == String
cols[i] = categorical(icol)
end
end
julia> for (n, v) in eachcol(behavior)
println(rpad(n, 25), typeof(v))
end
Trial Array{Int64,1}
GoNoGo_Group CategoricalArrays.CategoricalArray{String,1,UInt32}
HandDecision CategoricalArrays.CategoricalArray{String,1,UInt32}
GoNoGoDecision CategoricalArrays.CategoricalArray{String,1,UInt32}
Subject Array{Int64,1}
Accuracy Array{Int64,1}
Gender CategoricalArrays.CategoricalArray{String,1,UInt32}
InitialSound CategoricalArrays.CategoricalArray{String,1,UInt32}
SUBTLEX_LogFrequency Array{Float64,1}
Syllables Array{Int64,1}
Item CategoricalArrays.CategoricalArray{String,1,UInt32}
GoNoGo CategoricalArrays.CategoricalArray{String,1,UInt32}
but I suspect that most students will start looking at their smartphones at that point having decided the Julia is just too difficult to work with.
The other example I am working on is even worse because there are missing data values in the .csv
file and the analysis in R
used Z-scores of some of the covariates. What we get is
julia> perception = CSV.read("perception_study_data_NA.csv", header = 1, null = "NA");
julia> for (n, v) in eachcol(perception)
println(rpad(n, 30), typeof(v))
end
SYLL_NUM NullableArrays.NullableArray{Int64,1}
SENTENCE NullableArrays.NullableArray{WeakRefString{UInt8},1}
LABEL NullableArrays.NullableArray{WeakRefString{UInt8},1}
FUNCTION NullableArrays.NullableArray{WeakRefString{UInt8},1}
EXPERIMENT NullableArrays.NullableArray{WeakRefString{UInt8},1}
RIGHTEDGE NullableArrays.NullableArray{Int64,1}
PRIMARY NullableArrays.NullableArray{Int64,1}
PRIMARY_STRING NullableArrays.NullableArray{WeakRefString{UInt8},1}
SYLL_MAXF0 NullableArrays.NullableArray{Float64,1}
SYLL_MAXF0.in.Semitones NullableArrays.NullableArray{Float64,1}
SYLL_MINF0 NullableArrays.NullableArray{Float64,1}
SYLL_MINF0.in.Semitones NullableArrays.NullableArray{Float64,1}
SYLL_EXCUR_SIZE NullableArrays.NullableArray{Float64,1}
SYLL_MEANF0 NullableArrays.NullableArray{Float64,1}
SYLL_MEANF0_ST NullableArrays.NullableArray{Float64,1}
SYLL_MEAN_INT NullableArrays.NullableArray{Float64,1}
SYLL_DUR NullableArrays.NullableArray{Float64,1}
SYLL_DUR_SECS NullableArrays.NullableArray{Float64,1}
SYLL_F0_OVER_MEAN_SENT_F0 NullableArrays.NullableArray{Float64,1}
SYL_DUR_OVER_SENT_DUR NullableArrays.NullableArray{Float64,1}
SYLL_INT_OVER_SENT_MEAN_INT NullableArrays.NullableArray{Float64,1}
SUBJECT NullableArrays.NullableArray{WeakRefString{UInt8},1}
USER_RESP NullableArrays.NullableArray{Int64,1}
ITEM NullableArrays.NullableArray{Int64,1}
ONEBACK NullableArrays.NullableArray{WeakRefString{UInt8},1}
julia> zscore(perception[:SYLL_MAXF0])
ERROR: MethodError: no method matching zscore(::NullableArrays.NullableArray{Float64,1})
Closest candidates are:
zscore{T<:Real}(::AbstractArray{T<:Real,N}, ::Int64) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:396
zscore{T<:Real}(::AbstractArray{T<:Real,N}, ::Real, ::Real) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:385
zscore{T<:Real,U<:Real,S<:Real}(::AbstractArray{T<:Real,N}, ::AbstractArray{U<:Real,N}, ::AbstractArray{S<:Real,N}) at /home/bates/.julia/v0.5/StatsBase/src/scalarstats.jl:390
...
I know that I can evaluate the Z-scores by first converting the column to an Array
julia> perception[:SYLL_MAXF0_Z] = zscore(Array(perception[:SYLL_MAXF0]))
which will then conveniently be converted to a DataArray
, even though there are, directly as a result of the way it was created, no missing values.
julia> typeof(perception[:SYLL_MAXF0_Z])
DataArrays.DataArray{Float64,1}
It may be possible to use one of the DataFramesMeta
, Query
or StructuredQueries
packages to phrase this as a transform but the only one of these I have ever been able to use successfully is DataFramesMeta
, and even that is going to be kind of complicated to teach.
I know we are in the middle of a transition but we have been for a long time. I believe that the first SoC project on Nullables, etc. was in 2015 and the grant from the Moore Foundation to enhance statistical computing capabilities was about a year ago,
It is good to have a long term vision but I think we have a “best is the enemy of the good” problem here. We can’t describe to potential users how to go about some pretty basic data manipulation tasks because we are still thinking about the optimal “Brave New World” kind of structure.
I think it would be good, in addition to formulating grand plans, to also do some case studies and see how convenient it is to use Julia for practical data input and data manipulation as compared to R or Python. I am well aware of the difficulties of trying to direct open-source development - I am frequently guilty of “If you want that capability why don’t you write your own damn software?” responses. However, my understanding is that some direction of statistical computing capabilities was part of the purpose of the Moore Foundation grant and it seems to me that some kind of overview of how capabilities mesh together would fall under that purpose.