Indexing a DataFrame with a boolean DataFrame

I would like to set some values across a whole datafame using a boolean mask, similarly to this example with plain arrays:

julia> x = rand(3, 3)
3×3 Matrix{Float64}:
 0.0559191  0.249678   0.165318
 0.845124   0.83681    0.248268
 0.291518   0.0380153  0.205101

julia> x[x .< 0.5] .= 0;

julia> x
3×3 Matrix{Float64}:
 0.0       0.0      0.0
 0.845124  0.83681  0.0
 0.0       0.0      0.0

It turns out that this pattern is not allowed with dataframes though:

julia> df = DataFrame(rand(3,3), :auto)
3×3 DataFrame
 Row │ x1         x2        x3       
     │ Float64    Float64   Float64  
─────┼───────────────────────────────
   1 │ 0.045519   0.468771  0.387336
   2 │ 0.0133922  0.383619  0.418809
   3 │ 0.870746   0.898979  0.628106


julia> df .< 0.5
3×3 DataFrame
 Row │ x1     x2     x3    
     │ Bool   Bool   Bool  
─────┼─────────────────────
   1 │  true   true   true
   2 │  true   true   true
   3 │ false  false  false

julia> df[df .< 0.5] .= 0
ERROR: MethodError: no method matching getindex(::DataFrame, ::DataFrame)
Closest candidates are:
  getindex(::AbstractDataFrame, ::CartesianIndex{2}) at ~/.julia/packages/DataFrames/zqFGs/src/other/broadcasting.jl:3
  getindex(::AbstractDataFrame, ::Integer, ::Colon) at ~/.julia/packages/DataFrames/zqFGs/src/dataframerow/dataframerow.jl:210
  getindex(::AbstractDataFrame, ::Integer, ::Union{Colon, Regex, AbstractVector, All, Between, Cols, InvertedIndex}) at ~/.julia/packages/DataFrames/zqFGs/src/dataframerow/dataframerow.jl:208
  ...
Stacktrace:
 [1] maybeview
   @ ./views.jl:145 [inlined]
 [2] dotview(::DataFrame, ::DataFrame)
   @ Base.Broadcast ./broadcast.jl:1200
 [3] top-level scope
   @ REPL[39]:1

What is an alternative syntax?

df .= ifelse.(df .< 0.5, 0.0, df)
1 Like

Thanks, that solves my issue. Do you think that the boolean indexing syntax should be allowed in any case and I should open an issue in DataFrames.jl?

No, we will not allow it. The reason is as follows. For matrices Boolean indexing drops dimension:

julia> x = rand(3, 3)
3×3 Matrix{Float64}:
 0.565156  0.522537  0.151524
 0.934673  0.211889  0.807452
 0.248015  0.60691   0.72761

julia> x[x .< 0.5]
3-element Vector{Float64}:
 0.24801543258391845
 0.21188869476310024
 0.1515237855514393

and that is why the code you used works.

However, in DataFrames.jl context dropping dimension only makes sense if you select a single row like df[1, :] or a single column like df[:, 1]. Mixing dropping dimension across different rows/columns does not have a reasonable interpretation.

For this reason in DataFrames.jl, as opposed to arrays, we require that in indexing always exactly two arguments are passed - one for rows and one for columns so that we are sure that the result has rectangular shape. For this reason Boolean indexing is allowed but only along a given dimension (row or column).

2 Likes