Julia: DataFramesMeta Transformation

This question is already posted at StackOverflow on this link: DataFramesMeta Transformation. @ChrisRackauckas introduced me to this forum and so I’m hoping someone might be able to help me about my problem. I’m trying to reproduce the following R codes in Julia

library(dplyr)

women_new <- rbind(women, c(NA, 1), c(NA, NA))
women_new %>% 
  filter(height %>% complete.cases) %>%
  mutate(sector = character(n()),
         sector = replace(sector, height >= 0 & height <= 60, "1"),
         sector = replace(sector, height >= 61 & height <= 67, "2"), 
         sector = replace(sector, height >= 68 & height <= 72, "3"))

My attempts in Julia are the following:

using DataFrames
using DataFramesMeta
using Lazy
using RDatasets

women = @> begin
  "datasets" 
  dataset("women")
  DataArray()
  vcat([[NA NA]; [NA NA]])
end

women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
women_new[16, 2] = 1;

My first question here is that, is there a way to input 1 immediately on vcat([[NA 1]; [NA NA]]) just like in R? It returns the following error if I do so:

MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
 in macro expansion at multidimensional.jl:431 [inlined]
 in macro expansion at cartesian.jl:64 [inlined]
 in macro expansion at multidimensional.jl:429 [inlined]
 in _unsafe_batchsetindex!(::Array{Int64,2}, ::Base.Repeated{DataArrays.NAtype}, ::UnitRange{Int64}, ::UnitRange{Int64}) at multidimensional.jl:421
 in setindex!(::Array{Int64,2}, ::DataArrays.NAtype, ::UnitRange{Int64}, ::UnitRange{Int64}) at abstractarray.jl:832
 in cat_t(::Int64, ::Type{T}, ::DataArrays.NAtype, ::Vararg{Any,N}) at abstractarray.jl:1098
 in hcat(::DataArrays.NAtype, ::Int64) at abstractarray.jl:1180
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

My second question is, is there a way to convert DataArray to DataFrame? In this case the column names become X1, X2, ... or any default name in DataFrame since DataArray does not have column names. I think it is neater than typing the following:

women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);

I wish I could simply do convert(DataFrame, women) and simply rename the column names. But that conversion does not work. And the following is my attempt on transformation or mutation in the case of R.

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Sector = NA,
    Sector = ifelse(:Height .>=  0 & :Height .<= 60, 1,
             ifelse(:Height .>= 61 & :Height .<= 67, 2,
             ifelse(:Height .>= 68 & :Height .<= 72, 3, NA)))
    )
end

But this will return:

15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Sector│
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 1     │
│ 5   │ 62     │ 126    │ 1     │
│ 6   │ 63     │ 129    │ 1     │
│ 7   │ 64     │ 132    │ 1     │
│ 8   │ 65     │ 135    │ 1     │
│ 9   │ 66     │ 139    │ 1     │
│ 10  │ 67     │ 142    │ 1     │
│ 11  │ 68     │ 146    │ 1     │
│ 12  │ 69     │ 150    │ 1     │
│ 13  │ 70     │ 154    │ 1     │
│ 14  │ 71     │ 159    │ 1     │
│ 15  │ 72     │ 164    │ 1     │

which is not equivalent to R, I also tried the following:

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Sector = NA,
    Sector = :Height .>=  0 & :Height .<= 60 ? 1 :
             :Height .>= 61 & :Height .<= 67 ? 2 :
             :Height .>= 68 & :Height .<= 72 ? 3 :
            NA;
    )
end

But returns the following error:

TypeError: non-boolean (DataArrays.DataArray{Bool,1}) used in boolean context
 in (::###469#303)(::DataArrays.DataArray{Int64,1}) at DataFramesMeta.jl:55
 in (::##298#302)(::DataFrames.DataFrame) at DataFramesMeta.jl:295
 in #transform#38(::Array{Any,1}, ::Function, ::DataFrames.DataFrame) at DataFramesMeta.jl:270
 in (::DataFramesMeta.#kw##transform)(::Array{Any,1}, ::DataFramesMeta.#transform, ::DataFrames.DataFrame) at <missing>:0
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

I do appreciate if you can help me figure this out. Finally, my last question is that, is there a way to shorten my code like that in R but still elegant?

I’m copying my answer at SO:

I got it. There is an effect on operator precedence, I thought parentheses are not needed.

    using DataFrames
    using DataFramesMeta
    using Lazy
    using RDatasets
    
    women = dataset("datasets", "women");
    women_new = vcat(
                  women,
                  DataFrame(Height = [NA; NA], Weight = @data [1; NA])
                )
    
    @> begin
      women_new
      @where !isna(:Height)
      @transform(
        Class = NA,
        Class = ifelse((:Height .>=  0) & (:Height .<= 60), 1,
                ifelse((:Height .>= 61) & (:Height .<= 67), 2,
                ifelse((:Height .>= 68) & (:Height .<= 72), 3, NA)))
                )
    end

Update: The above code can be further simplified into:

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Class = @> begin
      function (x)
         0 <= x <= 60 ?  1 :
        61 <= x <= 67 ?  2 :
        68 <= x <= 72 ?  3 :
        NA
      end
      map(:Height)
    end
  )
end

The output is now correct:

15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 2     │
│ 5   │ 62     │ 126    │ 2     │
│ 6   │ 63     │ 129    │ 2     │
│ 7   │ 64     │ 132    │ 2     │
│ 8   │ 65     │ 135    │ 2     │
│ 9   │ 66     │ 139    │ 2     │
│ 10  │ 67     │ 142    │ 2     │
│ 11  │ 68     │ 146    │ 3     │
│ 12  │ 69     │ 150    │ 3     │
│ 13  │ 70     │ 154    │ 3     │
│ 14  │ 71     │ 159    │ 3     │
│ 15  │ 72     │ 164    │ 3     │

If we don’t want to filter NAs and work with the complete data, then the best I can is the following:

@> begin
  women_new
  @transform(
    Height_New = NA,
    Height_New = ifelse(isna(:Height), -1, :Height))
  @transform(
    Class = NA,
    Class = ifelse(:Height_New == -1, NA,
              ifelse((:Height_New .>=  0) & (:Height_New .<= 60), 1,
              ifelse((:Height_New .>= 61) & (:Height_New .<= 67), 2,
              ifelse((:Height_New .>= 68) & (:Height_New .<= 72), 3, NA))))
  )
  delete!(:Height_New)
end

Update: The above code can be simplified to:

@> begin
  women_new
  @transform(
    Class = @> begin
      function (x)
        isna(x)       ? NA :
         0 <= x <= 60 ?  1 :
        61 <= x <= 67 ?  2 :
        68 <= x <= 72 ?  3 :
        NA
      end
      map(:Height)
    end
  )
end

The output:

17×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 2     │
│ 5   │ 62     │ 126    │ 2     │
│ 6   │ 63     │ 129    │ 2     │
│ 7   │ 64     │ 132    │ 2     │
│ 8   │ 65     │ 135    │ 2     │
│ 9   │ 66     │ 139    │ 2     │
│ 10  │ 67     │ 142    │ 2     │
│ 11  │ 68     │ 146    │ 3     │
│ 12  │ 69     │ 150    │ 3     │
│ 13  │ 70     │ 154    │ 3     │
│ 14  │ 71     │ 159    │ 3     │
│ 15  │ 72     │ 164    │ 3     │
│ 16  │ NA     │ 1      │ NA    │
│ 17  │ NA     │ NA     │ NA    │

In this case, the code becomes messy because there is no method yet for handling NAs in ifelse’s first argument.

1 Like

You can also use Query.jl for this:

r = @from i in women_new begin
    @where !any(isnull,i)
    @select {i.height, i.weight, class = 0 <= i.height <=60 ? DataValue(1) : 60 < i.height <= 67 ? DataValue(1) : 67 < i.height <= 72 ? DataValue(2) : DataValue{Int64}()}
    @collect DataFrame
end
```
5 Likes

That is cool, I made a slight modification on your code.

@from i in women_new begin
    @where !any(isnull,i)
    @select {
      i.Height, i.Weight, 
      class = 0 <= i.Height <= 60 ?  1 :
             61 <= i.Height <= 67 ?  2 :
             68 <= i.Height <= 72 ?  3 :
              0
    }
    @collect DataFrame
end

The DataValue seems to be not working on my Julia, if I retain it I keep getting the following error:

UndefVarError: DataValue not defined
 in (::##54#56)(::NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}) at <missing>:0
 in (::FunctionWrappers.CallWrapper{NamedTuples._NT_HeightWeightclass})(::##54#56, ::NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}) at <missing>:0
 in macro expansion at FunctionWrappers.jl:72 [inlined]
 in do_ccall at FunctionWrappers.jl:63 [inlined]
 in FunctionWrapper at FunctionWrappers.jl:78 [inlined]
 in next(::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}, ::Query.EnumerableWhereState{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Int64}) at enumerable_select.jl:31
 in macro expansion at sink_dataframe.jl:16 [inlined]
 in _filldf(::Tuple{Array{Any,1},Array{Any,1},Array{Any,1}}, ::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}) at sink_dataframe.jl:4
 in collect(::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}, ::Type{DataFrames.DataFrame}) at sink_dataframe.jl:34
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

I added the DataValue type in Query 0.4.0, I think you are on an older version.

But given that you filter the rows with missing values first, your code is better in any case.

DataValue is used for missing values in Query. It is almost identical to Nullable, with a few differences that make it easier to use.

Oh, one more thing: my code filters rows with a missing value in any column out. If you know which column might have a missing value you can use isnull(i.column_name) instead, which should be faster.