Julia: DataFramesMeta Transformation

alasaadstat · April 29, 2017, 5:27pm

This question is already posted at StackOverflow on this link: DataFramesMeta Transformation. @ChrisRackauckas introduced me to this forum and so I’m hoping someone might be able to help me about my problem. I’m trying to reproduce the following R codes in Julia

library(dplyr)

women_new <- rbind(women, c(NA, 1), c(NA, NA))
women_new %>% 
  filter(height %>% complete.cases) %>%
  mutate(sector = character(n()),
         sector = replace(sector, height >= 0 & height <= 60, "1"),
         sector = replace(sector, height >= 61 & height <= 67, "2"), 
         sector = replace(sector, height >= 68 & height <= 72, "3"))

My attempts in Julia are the following:

using DataFrames
using DataFramesMeta
using Lazy
using RDatasets

women = @> begin
  "datasets" 
  dataset("women")
  DataArray()
  vcat([[NA NA]; [NA NA]])
end

women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);
women_new[16, 2] = 1;

My first question here is that, is there a way to input 1 immediately on vcat([[NA 1]; [NA NA]]) just like in R? It returns the following error if I do so:

MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
 in macro expansion at multidimensional.jl:431 [inlined]
 in macro expansion at cartesian.jl:64 [inlined]
 in macro expansion at multidimensional.jl:429 [inlined]
 in _unsafe_batchsetindex!(::Array{Int64,2}, ::Base.Repeated{DataArrays.NAtype}, ::UnitRange{Int64}, ::UnitRange{Int64}) at multidimensional.jl:421
 in setindex!(::Array{Int64,2}, ::DataArrays.NAtype, ::UnitRange{Int64}, ::UnitRange{Int64}) at abstractarray.jl:832
 in cat_t(::Int64, ::Type{T}, ::DataArrays.NAtype, ::Vararg{Any,N}) at abstractarray.jl:1098
 in hcat(::DataArrays.NAtype, ::Int64) at abstractarray.jl:1180
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

My second question is, is there a way to convert DataArray to DataFrame? In this case the column names become X1, X2, ... or any default name in DataFrame since DataArray does not have column names. I think it is neater than typing the following:

women_new = DataFrame(Height = women[:, 1], Weight = women[:, 2]);

I wish I could simply do convert(DataFrame, women) and simply rename the column names. But that conversion does not work. And the following is my attempt on transformation or mutation in the case of R.

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Sector = NA,
    Sector = ifelse(:Height .>=  0 & :Height .<= 60, 1,
             ifelse(:Height .>= 61 & :Height .<= 67, 2,
             ifelse(:Height .>= 68 & :Height .<= 72, 3, NA)))
    )
end

But this will return:

15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Sector│
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 1     │
│ 5   │ 62     │ 126    │ 1     │
│ 6   │ 63     │ 129    │ 1     │
│ 7   │ 64     │ 132    │ 1     │
│ 8   │ 65     │ 135    │ 1     │
│ 9   │ 66     │ 139    │ 1     │
│ 10  │ 67     │ 142    │ 1     │
│ 11  │ 68     │ 146    │ 1     │
│ 12  │ 69     │ 150    │ 1     │
│ 13  │ 70     │ 154    │ 1     │
│ 14  │ 71     │ 159    │ 1     │
│ 15  │ 72     │ 164    │ 1     │

which is not equivalent to R, I also tried the following:

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Sector = NA,
    Sector = :Height .>=  0 & :Height .<= 60 ? 1 :
             :Height .>= 61 & :Height .<= 67 ? 2 :
             :Height .>= 68 & :Height .<= 72 ? 3 :
            NA;
    )
end

But returns the following error:

TypeError: non-boolean (DataArrays.DataArray{Bool,1}) used in boolean context
 in (::###469#303)(::DataArrays.DataArray{Int64,1}) at DataFramesMeta.jl:55
 in (::##298#302)(::DataFrames.DataFrame) at DataFramesMeta.jl:295
 in #transform#38(::Array{Any,1}, ::Function, ::DataFrames.DataFrame) at DataFramesMeta.jl:270
 in (::DataFramesMeta.#kw##transform)(::Array{Any,1}, ::DataFramesMeta.#transform, ::DataFrames.DataFrame) at <missing>:0
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

I do appreciate if you can help me figure this out. Finally, my last question is that, is there a way to shorten my code like that in R but still elegant?

alasaadstat · April 29, 2017, 7:32pm

I’m copying my answer at SO:

I got it. There is an effect on operator precedence, I thought parentheses are not needed.

    using DataFrames
    using DataFramesMeta
    using Lazy
    using RDatasets
    
    women = dataset("datasets", "women");
    women_new = vcat(
                  women,
                  DataFrame(Height = [NA; NA], Weight = @data [1; NA])
                )
    
    @> begin
      women_new
      @where !isna(:Height)
      @transform(
        Class = NA,
        Class = ifelse((:Height .>=  0) & (:Height .<= 60), 1,
                ifelse((:Height .>= 61) & (:Height .<= 67), 2,
                ifelse((:Height .>= 68) & (:Height .<= 72), 3, NA)))
                )
    end

Update: The above code can be further simplified into:

@> begin
  women_new
  @where !isna(:Height)
  @transform(
    Class = @> begin
      function (x)
         0 <= x <= 60 ?  1 :
        61 <= x <= 67 ?  2 :
        68 <= x <= 72 ?  3 :
        NA
      end
      map(:Height)
    end
  )
end

The output is now correct:

15×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 2     │
│ 5   │ 62     │ 126    │ 2     │
│ 6   │ 63     │ 129    │ 2     │
│ 7   │ 64     │ 132    │ 2     │
│ 8   │ 65     │ 135    │ 2     │
│ 9   │ 66     │ 139    │ 2     │
│ 10  │ 67     │ 142    │ 2     │
│ 11  │ 68     │ 146    │ 3     │
│ 12  │ 69     │ 150    │ 3     │
│ 13  │ 70     │ 154    │ 3     │
│ 14  │ 71     │ 159    │ 3     │
│ 15  │ 72     │ 164    │ 3     │

If we don’t want to filter NAs and work with the complete data, then the best I can is the following:

@> begin
  women_new
  @transform(
    Height_New = NA,
    Height_New = ifelse(isna(:Height), -1, :Height))
  @transform(
    Class = NA,
    Class = ifelse(:Height_New == -1, NA,
              ifelse((:Height_New .>=  0) & (:Height_New .<= 60), 1,
              ifelse((:Height_New .>= 61) & (:Height_New .<= 67), 2,
              ifelse((:Height_New .>= 68) & (:Height_New .<= 72), 3, NA))))
  )
  delete!(:Height_New)
end

Update: The above code can be simplified to:

@> begin
  women_new
  @transform(
    Class = @> begin
      function (x)
        isna(x)       ? NA :
         0 <= x <= 60 ?  1 :
        61 <= x <= 67 ?  2 :
        68 <= x <= 72 ?  3 :
        NA
      end
      map(:Height)
    end
  )
end

The output:

17×3 DataFrames.DataFrame
│ Row │ Height │ Weight │ Class │
├─────┼────────┼────────┼───────┤
│ 1   │ 58     │ 115    │ 1     │
│ 2   │ 59     │ 117    │ 1     │
│ 3   │ 60     │ 120    │ 1     │
│ 4   │ 61     │ 123    │ 2     │
│ 5   │ 62     │ 126    │ 2     │
│ 6   │ 63     │ 129    │ 2     │
│ 7   │ 64     │ 132    │ 2     │
│ 8   │ 65     │ 135    │ 2     │
│ 9   │ 66     │ 139    │ 2     │
│ 10  │ 67     │ 142    │ 2     │
│ 11  │ 68     │ 146    │ 3     │
│ 12  │ 69     │ 150    │ 3     │
│ 13  │ 70     │ 154    │ 3     │
│ 14  │ 71     │ 159    │ 3     │
│ 15  │ 72     │ 164    │ 3     │
│ 16  │ NA     │ 1      │ NA    │
│ 17  │ NA     │ NA     │ NA    │

In this case, the code becomes messy because there is no method yet for handling NAs in ifelse’s first argument.

davidanthoff · April 29, 2017, 11:22pm

You can also use Query.jl for this:

r = @from i in women_new begin
    @where !any(isnull,i)
    @select {i.height, i.weight, class = 0 <= i.height <=60 ? DataValue(1) : 60 < i.height <= 67 ? DataValue(1) : 67 < i.height <= 72 ? DataValue(2) : DataValue{Int64}()}
    @collect DataFrame
end
```

alasaadstat · April 30, 2017, 6:06pm

That is cool, I made a slight modification on your code.

@from i in women_new begin
    @where !any(isnull,i)
    @select {
      i.Height, i.Weight, 
      class = 0 <= i.Height <= 60 ?  1 :
             61 <= i.Height <= 67 ?  2 :
             68 <= i.Height <= 72 ?  3 :
              0
    }
    @collect DataFrame
end

The DataValue seems to be not working on my Julia, if I retain it I keep getting the following error:

UndefVarError: DataValue not defined
 in (::##54#56)(::NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}) at <missing>:0
 in (::FunctionWrappers.CallWrapper{NamedTuples._NT_HeightWeightclass})(::##54#56, ::NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}) at <missing>:0
 in macro expansion at FunctionWrappers.jl:72 [inlined]
 in do_ccall at FunctionWrappers.jl:63 [inlined]
 in FunctionWrapper at FunctionWrappers.jl:78 [inlined]
 in next(::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}, ::Query.EnumerableWhereState{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Int64}) at enumerable_select.jl:31
 in macro expansion at sink_dataframe.jl:16 [inlined]
 in _filldf(::Tuple{Array{Any,1},Array{Any,1},Array{Any,1}}, ::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}) at sink_dataframe.jl:4
 in collect(::Query.EnumerableSelect{NamedTuples._NT_HeightWeightclass,Query.EnumerableWhere{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Query.EnumerableDF{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}},Tuple{DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}}},FunctionWrappers.FunctionWrapper{Bool,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}},FunctionWrappers.FunctionWrapper{NamedTuples._NT_HeightWeightclass,Tuple{NamedTuples._NT_HeightWeight{Nullable{Int64},Nullable{Int64}}}}}, ::Type{DataFrames.DataFrame}) at sink_dataframe.jl:34
 in include_string(::String, ::String) at loading.jl:441
 in include_string(::String, ::String, ::Int64) at eval.jl:30
 in include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N}) at eval.jl:34
 in (::Atom.##53#56{String,Int64,String})() at eval.jl:50
 in withpath(::Atom.##53#56{String,Int64,String}, ::String) at utils.jl:30
 in withpath(::Function, ::String) at eval.jl:38
 in macro expansion at eval.jl:49 [inlined]
 in (::Atom.##52#55{Dict{String,Any}})() at task.jl:60

davidanthoff · April 30, 2017, 6:54pm

I added the DataValue type in Query 0.4.0, I think you are on an older version.

But given that you filter the rows with missing values first, your code is better in any case.

DataValue is used for missing values in Query. It is almost identical to Nullable, with a few differences that make it easier to use.

Oh, one more thing: my code filters rows with a missing value in any column out. If you know which column might have a missing value you can use isnull(i.column_name) instead, which should be faster.

Topic		Replies	Views
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10500	April 22, 2022
Can DataFramesMeta replace dummy values for all columns of a specific type General Usage question , dataframes , dataframesmeta	12	357	April 18, 2024
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4022	April 24, 2020
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017
How to filter out missings (using DataFramesMeta @where) New to Julia dataframes	2	689	July 22, 2019

Julia: DataFramesMeta Transformation

Related topics