Ifelse in a DataFrame

CompulsoryCoffee · February 14, 2022, 10:31pm

I recently found out about the ifelse statement in Julia.

test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5])

 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      1
   2 │     2  0.329433      2      2
   3 │     3  0.491057      2      3
   4 │     4  0.468994      2      4
   5 │     5  0.359621      2      5
   6 │     6  0.16639       5      6
   7 │     7  0.259958      5      7
   8 │     8  0.591549      5      8
   9 │     9  0.48926       5      9
  10 │    10  0.110892      5     10


f(x) = (x==5) ? 1 : 0
test.d = f.(test.c)

10×4 DataFrame
 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      0
   2 │     2  0.329433      2      0
   3 │     3  0.491057      2      0
   4 │     4  0.468994      2      0
   5 │     5  0.359621      2      0
   6 │     6  0.16639       5      1
   7 │     7  0.259958      5      1
   8 │     8  0.591549      5      1
   9 │     9  0.48926       5      1
  10 │    10  0.110892      5      1

Is there a way to get the if without the else to keep the same column values?

to get something like this:

 Row │ a      b         c      d     
     │ Int64  Float64   Int64  Int64 
─────┼───────────────────────────────
   1 │     1  0.604762      2      1
   2 │     2  0.329433      2      2
   3 │     3  0.491057      2      3
   4 │     4  0.468994      2      4
   5 │     5  0.359621      2      5
   6 │     6  0.16639       5      1
   7 │     7  0.259958      5      1
   8 │     8  0.591549      5      1
   9 │     9  0.48926       5      1
  10 │    10  0.110892      5      1

jling · February 14, 2022, 10:38pm

test.d = [(t.c == 5 ? 1 : t.a) for t in eachrow(test)];

DataFrames · February 15, 2022, 3:46am

don’t use eachrow

test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5], d = 1:10)
test[test.c .== 5, :d] .= 1

digital_carver · February 15, 2022, 3:59am

This method mentioned by @DataFrames is the way. The test.c .== 5 specifies which rows to choose, and the :d specifies which column to change within that row.

By the way, the (x==5) ? 1 : 0 syntax you mention is called the “ternary operator”. ifelse is something different, that’s also present in Julia (?ifelse will describe the difference, though the ternary operator and normal if are much more common than ifelse).

CompulsoryCoffee · February 15, 2022, 9:17am

I found that I can use another variable to the function to do the trick (even though I will be using the ticked solution).

f(x,y) = (x==5) ? 1 : y
test.d = f.(test.c, test.d)

jling · February 15, 2022, 9:25am

Well this creates temporary mask array, so depending on how large is your problem…

bkamins · February 15, 2022, 9:28am

@. test.d = ifelse(test.c == 5, 1, test.d)

CompulsoryCoffee · February 15, 2022, 9:32am

Out of curiosity, what would be the most efficient way of doing this (memory and speed)?

jling · February 15, 2022, 9:35am

For loop. (Implicit in that comprehension)

I get it, DataFrame users look down at for loop, but come on do we never play with big data? I know DataFrames.jl is not good at for loop because we traded for loop speed with less latency by not recording column type, but you can use DataFramesMeta.jl

bkamins · February 15, 2022, 9:47am

My solution is essentially for loop but without writing it down explicitly as what I propose is more compact.

julia> using DataFrames

julia> test  = DataFrame(a = 1:10, b = rand(10), c=[2,2,2,2,2,5,5,5,5,5]);

julia> test_large = repeat(test, 10^7);

julia> @time @. test.d = ifelse(test.c == 5, 1, test.c);
  0.141236 seconds (301.00 k allocations: 15.882 MiB, 99.81% compilation time)

julia> @time @. test_large.d = ifelse(test_large.c == 5, 1, test_large.c);
  0.212442 seconds (23 allocations: 762.940 MiB, 29.64% gc time)

@jling - can you please show your proposal so that we can compare timing on the same data?

Let us just settle on the use case: we want to measure timing of creating new column :d, I mention this, because if we wanted to update :d in place it would be a bit different syntax and faster:

julia> @time @. test_large[:, :d] = ifelse(test_large.c == 5, 1, test_large.c);
  0.097687 seconds (7 allocations: 272 bytes)

jling · February 15, 2022, 9:52am

Oh I didn’t see it was a reply quoting your solution. Yes, your solution is the best of both worlds – doesn’t have mask array, also doesn’t loop over eachrow()

(To make comprehension/ looping over eachrow() as fast, we need DataFramesMeta I think?)

CompulsoryCoffee · February 15, 2022, 9:53am

Thanks!

bkamins · February 15, 2022, 9:54am

You need @eachrow! from DataFramesMeta.jl Introduction · DataFramesMeta Documentation.

bkamins · February 15, 2022, 9:56am

This is an in-place variant, and it will be slower as it allocates mask array as @jling commented:

julia> @time test_large[test_large.c .== 5, :d] .= 1;
  0.292150 seconds (10 allocations: 393.395 MiB, 23.50% gc time)

Topic		Replies	Views
How do I add a new column to a dataframe in Julia using conditional logic? General Usage dataframes , dataframesmeta	5	2173	February 26, 2022
Ifelse and datawrangling New to Julia control , dataframes	6	553	June 10, 2022
Dataframe Filter New to Julia question , dataframes	6	5949	March 26, 2022
If-else applied to a DataFrame New to Julia	11	2252	August 20, 2020
Questions about DataFrame New to Julia question , dataframes	2	464	November 5, 2022

Ifelse in a DataFrame

Related topics