Ifelse in a DataFrame

I recently found out about the ifelse statement in Julia.

test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5])

 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      1
   2 │     2  0.329433      2      2
   3 │     3  0.491057      2      3
   4 │     4  0.468994      2      4
   5 │     5  0.359621      2      5
   6 │     6  0.16639       5      6
   7 │     7  0.259958      5      7
   8 │     8  0.591549      5      8
   9 │     9  0.48926       5      9
  10 │    10  0.110892      5     10


f(x) = (x==5) ? 1 : 0
test.d = f.(test.c)

10×4 DataFrame
 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      0
   2 │     2  0.329433      2      0
   3 │     3  0.491057      2      0
   4 │     4  0.468994      2      0
   5 │     5  0.359621      2      0
   6 │     6  0.16639       5      1
   7 │     7  0.259958      5      1
   8 │     8  0.591549      5      1
   9 │     9  0.48926       5      1
  10 │    10  0.110892      5      1

Is there a way to get the if without the else to keep the same column values?

to get something like this:

 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      1
   2 │     2  0.329433      2      2
   3 │     3  0.491057      2      3
   4 │     4  0.468994      2      4
   5 │     5  0.359621      2      5
   6 │     6  0.16639       5      1
   7 │     7  0.259958      5      1
   8 │     8  0.591549      5      1
   9 │     9  0.48926       5      1
  10 │    10  0.110892      5      1
test.d = [(t.c == 5 ? 1 : t.a) for t in eachrow(test)];

don’t use eachrow

test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5], d = 1:10)
test[test.c .== 5, :d] .= 1
2 Likes

This method mentioned by @DataFrames is the way. The test.c .== 5 specifies which rows to choose, and the :d specifies which column to change within that row.

By the way, the (x==5) ? 1 : 0 syntax you mention is called the “ternary operator”. ifelse is something different, that’s also present in Julia (?ifelse will describe the difference, though the ternary operator and normal if are much more common than ifelse).

I found that I can use another variable to the function to do the trick (even though I will be using the ticked solution).

f(x,y) = (x==5) ? 1 : y
test.d = f.(test.c, test.d)

Well this creates temporary mask array, so depending on how large is your problem…

@. test.d = ifelse(test.c == 5, 1, test.d)
1 Like

Out of curiosity, what would be the most efficient way of doing this (memory and speed)?

For loop. (Implicit in that comprehension)

I get it, DataFrame users look down at for loop, but come on do we never play with big data? I know DataFrames.jl is not good at for loop because we traded for loop speed with less latency by not recording column type, but you can use DataFramesMeta.jl

My solution is essentially for loop but without writing it down explicitly as what I propose is more compact.

julia> using DataFrames

julia> test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5]);

julia> test_large = repeat(test, 10^7);

julia> @time @. test.d = ifelse(test.c == 5, 1, test.c);
  0.141236 seconds (301.00 k allocations: 15.882 MiB, 99.81% compilation time)

julia> @time @. test_large.d = ifelse(test_large.c == 5, 1, test_large.c);
  0.212442 seconds (23 allocations: 762.940 MiB, 29.64% gc time)

@jling - can you please show your proposal so that we can compare timing on the same data?

Let us just settle on the use case: we want to measure timing of creating new column :d, I mention this, because if we wanted to update :d in place it would be a bit different syntax and faster:

julia> @time @. test_large[:, :d] = ifelse(test_large.c == 5, 1, test_large.c);
  0.097687 seconds (7 allocations: 272 bytes)
2 Likes

Oh I didn’t see it was a reply quoting your solution. Yes, your solution is the best of both worlds – doesn’t have mask array, also doesn’t loop over eachrow()

(To make comprehension/ looping over eachrow() as fast, we need DataFramesMeta I think?)

Thanks!

You need @eachrow! from DataFramesMeta.jl Introduction · DataFramesMeta Documentation.

This is an in-place variant, and it will be slower as it allocates mask array as @jling commented:

julia> @time test_large[test_large.c .== 5, :d] .= 1;
  0.292150 seconds (10 allocations: 393.395 MiB, 23.50% gc time)
1 Like