Replacing some elements of an array with corresponding elements from another array

My goal is to replace any elements in array A that are either NaN or <0 with the corresponding elements in Array B.

Below is my code:

A = [1, 2.1, -5, 6, -3, 8, NaN];
B = [8, 8.2,  8, 8,  8, 8, 8];
C = replace!(x -> (x<0 || isnan(x)) ? B : x, A)

Here is the error:

MethodError: Cannot `convert` an object of type Vector{Float64} to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...

Why isn’t it working? Many thanks!

Here is one way to do that.

A = [1, 2.1, -5, 6, -3, 8, NaN];
B = [8, 8.2,  8, 8,  8, 8, 8];

isreplaceable(x) = x < 0 || isnan(x)

function replacewith!(A::AT, B::AT, predicate) where
                      {N, T, AT<:AbstractArray{T,N}}
   here = map(predicate, A) 
   A[here] .= B[here]
   A
end

# and use this way
replacewith!(A,B, isreplaceable);

# A becomes [1.0 , 2.1,  8.0,  6.0,  8.0,  8.0,  8.0]

Be that as it may, use @DataFrames approach (next).

2 Likes

use loop

isreplaceable(x) = x < 0 || isnan(x)
function replacewith!(A, B, predicate)
        for i in 1:length(A)
             if isreplaceable(A[i])
                 A[i] = B[i]
             end
      end
     A
end
5 Likes

Possibly because x iterates over the scalar elements of A and that code tries to replace such elements by full B arrays.

A comprehension could be used:

C = [a < 0 || isnan(a) ? b : a for (a,b) in zip(A,B)]
6 Likes

Another option:

inds = findall(x->x<0 || isnan(x), A)
A[inds] .= B[inds]

I wish we could write findall(<(0) || isnan, A) :slight_smile:

3 Likes

One could even get rid of findall:

inds = (A .< 0) .| isnan.(A)
A[inds] .= B[inds]
1 Like

I wonder though if it’s maybe less efficient, since it then has to loop over all the false values too, in the assignment?

1 Like

Many thanks for all the solution. My data has millions of rows and I’m able to test their performance:

This is the fastest way of doing this (106 seconds):

B= replace!(x -> (x<0 || isnan(x)) ? 0 : x, B);
A = [a < 0 || isnan(a) ? b : a for (a, b) in zip(A, B)];

This is the 2nd fastest (196 seconds):

inds1 = findall(x -> x<0 || isnan(x), B);
B[inds1] .= 0;

inds2 = findall(x -> x<0 || isnan(x), A);
A[inds2,20] .= B[inds2]

This is the slowest method (330 seconds):

B[B.<0 .| isnan.(B)] .= 0;

Ind2 = map(A) do a
    a < 0 || isnan(a);
end
Ind2 = vec(Ind2);
if ~isempty(Ind2)
    A[Ind2] = B[Ind2];
end
1 Like

@leon, it looks like you’ve changed the original problem and now B also has negative values and NaNs. This may confuse the logic of the post a bit.

The fastest way is quite certainly a loop, as suggested above. Maybe something like this is close to optimal:

julia> function replacewith!(A, B, isrepleceable)
           for i in eachindex(A)
               @inbounds A[i] = ifelse(isrepleceable(A[i]), B[i], A[i])
           end
           return A
       end
replacewith! (generic function with 1 method)

julia> @btime replacewith!(A,B,x -> x < 0 || isnan(x)) setup=(A = [1, 2.1, -5, 6, -3, 8, NaN]; B = [8, 8.2,  8, 8,  8, 8, 8]);
  7.139 ns (0 allocations: 0 bytes)

To compare with, for example:

julia> function rep(A,B)
       B= replace!(x -> (x<0 || isnan(x)) ? 0 : x, B);
       A = [a < 0 || isnan(a) ? b : a for (a, b) in zip(A, B)];
       end
rep (generic function with 1 method)

julia> @btime rep(A,B) setup=(A = [1, 2.1, -5, 6, -3, 8, NaN]; B = [8, 8.2,  8, 8,  8, 8, 8]);
  39.938 ns (1 allocation: 144 bytes)


Or maybe this is a better benchmark:

julia> A = rand(10000); B = rand(10000);

julia> @btime replacewith!(a,b,x -> x < 0.1) setup=(a = copy(A); b = copy(B)) evals=1;
  2.080 μs (0 allocations: 0 bytes)

julia> @btime rep(a,b) setup=(a = copy(A); b = copy(B)) evals=1; # adapted inner conditional
  23.956 μs (2 allocations: 78.20 KiB)

5 Likes

Wow, this is fascinating! I was always told that a loop will be the slowest way of handling anything. This is so interesting to know.

Loops are the lowest abstraction here : i.e. closest to the actual hardware process.
Loops are only slow when you compare interpreted loops (e.g. CPython or Matlab) with compiled libraries.
Because Julia loops are compiled, there should be no way to get faster with more abstract expressions (considering sequential counterparts).

This does not imply that abstract expressions should be systematically avoided : they can be more compact, less error prone and even allow for parallel implementation on various hardware.

6 Likes

Getting used to loops is also sort or liberating, because you don’t need to remember, learn, or search, for a function that does what you want every time. Just write what is logically natural and move forward. The impossibility of doing this (and get fast code) is what always kept me away from other high level languages, here you don’t need to adapt the logic to the language.

2 Likes

And you can enjoy a nother 7X speedup using @Elrod’s magic (aka LoopVectorization):


using LoopVectorization

function replacewith!(A, B)
    @tturbo for i in eachindex(A) 
        A[i] = (A[i] < 0) | isnan(A[i]) ? B[i] : A[i]
    end
    return A
end

A = rand(10000); B = rand(10000);

@btime replacewith!(a,b) setup=(a = copy(A); b = copy(B));
  956.098 ns (0 allocations: 0 bytes)
3 Likes

Using IfEelse.ifelse here may be even faster.

1 Like

Very cool. Thanks for sharing.

Is this kind of like parallel computing? BTW, my computer has a 8-Core Intel Core i9 process. What is the best way to make sure all of the 8 cores will be put into good use, but will not crash my computer?

Yes, that’s multithreading, a kind of light parallelization method. My results are on a 4-core machine and the number of threads is set to 8, for your processor the number of threads can be set to 16 and you can get more than 10X speedup. The number of threads can be set at the Julia command line as -t auto. You can find this info in the Julia manual:

-t, --threads {N|auto} Enable N threads; auto currently sets N to the number of local CPU threads but this might change in the future

2 Likes

LoopVectorization will do that automatically.

5 Likes