Make function more julian / simpler

Can someone tell me how to simplify (or make more) my following function.

It works. I am just curious if (and quite certain) there is a simpler way.

I have two arrays and I want to remove from first array all instances values that are in the second array.

The following needs to be ensured:

  • if first array contains multiple instances of values of second array only the number of instances that are in second array should be removed from first array
  • if second array contains values that are not in first array nothing needs to be removed

So essentially it is like removing cards from a deck where a deck may include multiple instances of the same card.

Example:

array1 = [1,1,1,2,2,3,4,5,3,2]
array2 = [34, 2,4,1,2]

Should yield for array1
[1,1,3,5,3,2]

Here is my function (which is working!)

function remove!(out, deck) 
    for card in out
        idx = findfirst(x -> x == card, deck)
        if isnothing(idx)
            continue
        end
        deleteat!(deck, idx)
    end
end

Thanks for your help!

I think it’s faster for large numbers if you count the elements in arr2 first, then go over arr1 once and update the dict with how many elements you delete. Then do the mask index once. In your version you loop over the array again and again (partially)

julia> function remove(arr1, arr2)
           d = Dict{eltype(arr2), Int}()
           for a in arr2
               d[a] = get!(d, a, 0) + 1
           end
           
           mask = ones(Bool, length(arr1))
           for (i, a) in enumerate(arr1)
               x = get(d, a, 0)
               if x > 0
                   d[a] = x - 1
                   mask[i] = false
               end
           end
           arr1[mask]
       end

julia> remove(array1, array2)
6-element Vector{Int64}:
 1
 1
 3
 5
 3
 2

Lowest hanging fruit is to reverse the sense of the if test

function remove!(out, deck) 
    for card in out
        idx = findfirst(x -> x == card, deck)
        if !isnothing(idx)
             deleteat!(deck, idx)
        end
    end

Some people like this shortcut

    !isnothing(idx) && deleteat!(deck, idx)

Nice, but this doesn’t change the argument in place as the OP’s code does.

Use a list comprehension to find the indices for all cards and then deleteat! them at once

indices = [findfirst(x->x==card, deck) for card in out if card in deck]
deleteat!(deck, indices)

Edit, I realized this doesn’t handle multiple occurences of card correctly

Thanks for the suggestion. It doesn’t exactly do what my code does, because it doesn’t change the original array. But if gives me some great new approaches. Thanks!

Yes, this was my first approach as well but it doesn’t work.

The && is new to me. So something to look into. Thanks.

function remove!(out, deck) 
    removable = countmap(out)
    to_remove = Int[]
    for (i,card) in deck
        if haskey(removable, card)
            push!(i,to_remove)
            removable[card] -= 1
            removable[card] == 0 && deleteat!(removable, card)
        end
    end
    deleteat!.(deck, reverse(to_remove))
    return deck
end

Perhaps something like this? You iterate over out and deck once each and otherwise are hitting hash indices I think.

I think countmap is maybe from statsbase?

On phone so I couldn’t make sure that deleteat! works for dicts

You can do the in place change at the end. I wouldn’t do it with deleteat everytime you find a value because then the whole tail of the array has to be moved. Just mutate the array once at the end.

1 Like

Thanks to everyone for your suggestions. Also my take away is that there isn’t an easy oneliner that could do the trick. Which is kind of comforting to realize.

I’m not sure if this will speed things up, but if you have the option you might consider sorting the lists and then using things like searchsortedfirst instead of findfirst. In general, I would guess that if you were willing to sort the lists before you do this you could write something that is a bit more manual but pretty quick.

EDIT: okay, didn’t read the docs of searchsortedfirst very closely, sorry. But still, sorting might be a helpful idea.

I am not an expert, so don’t take this answer too seriously.
I don’t know if this is a bad practice and would like to get the opinion of someone more experienced:

remove!(deck, out) = 
    for i in out
        indx = findfirst(==(i), deck)
        indx |> isnothing ? 
            nothing : 
            deleteat!(deck, indx)
    end

Another solution that seems to work (similar to the one proposed by jules):

find_n_indx(array, number, first_n) =       # get indices of first n occurrences of number in array
    (index for (index, value) in pairs(array) if value == number) |> 
        x -> Iterators.take(x, first_n) |> 
        collect

function remove_2!(deck, out)
    s = ((i, count(==(i), out)) for i in Set(out))      # unique out elements with relative frequencies in out [(out_elem, count out_elem in out), ..]
    indexes_to_remove = vcat((find_n_indx(deck, i[1], i[2]) for i in s)...)
    sort!(indexes_to_remove)
    deleteat!(deck, indexes_to_remove)
    nothing
end

But probably there are better solutions to calculate the indexes to remove

deleteat! is rather heavy operation. Faster is to do some sort of double cursor, when you shift elements as you go. In this case you can achieve O(N) complexity.

function mycountmap(arr)
    res = Dict{Int, Int}()
    for x in arr
        res[x] = get(res, x, 0) + 1
    end
    return res
end

function cleanup!(arr1, arr2)
    cmap = mycountmap(arr2)
    i1 = 1
    @inbounds for (i2, v) in pairs(arr1)
        i1 != i2 && (arr1[i1] = v)
        cval = get(cmap, v, 0)
        if cval > 0
            cmap[v] = cval - 1
        else
            i1 += 1
        end
    end
    resize!(arr1, i1 - 1)
    return arr1
end

julia> array1 = [1,1,1,2,2,3,4,5,3,2];
julia> array2 = [34, 2, 4, 1, 2];
julia> cleanup!(x, array2)
6-element Vector{Int64}:
 1
 1
 3
 5
 3
 2

UPD: This one is slower then initial version with find and delete. mycountmap take it’s time to initialize and generates overhead. So this approach is useful, when the size of the deck and out is big enough, that update and find time is much larger than mycountmap construction.

I took another swing as well, this time with a julia instance to check code against:

function remove!(out::Vector{Int64}, deck::Vector{Int64})
    # Iterate once over `out` to make a DefaultDict
    removable = Dict{eltype(out),Int64}()
    cnt = 0
    for v in out
        if haskey(removable, v)
            removable[v] += 1
        else
            removable[v] = 1
        end
        cnt += 1
    end

    # Iterate once over `deck` to do a one-pass version of findfirst
    to_remove = BitArray(undef,length(deck))
    @inbounds for (i,card) in enumerate(deck)
        res = haskey(removable, card)
        to_remove[i] = res
        if res
            if isone(cnt) 
                to_remove[(i+1):end] .= false
                break
            end
            cnt -= 1
            if isone(removable[card])
                pop!(removable, card)
            else
                removable[card] -= 1
            end
        end
    end
    # remove cards using the BitArray
    deleteat!(deck, to_remove)
end

This seems to perform worse than the OP’s code at small deck sizes and only marginally better at huge out/deck sizes. I am a bit flummoxed as to why. findfirst and deleteat! must both be way more efficient than I expected.

What you are describing is a multiset difference, and the easiest (and perhaps most efficient) way would probably be to use the Multisets.jl package:

julia> using Multisets

julia> m = Multiset(array1) - Multiset(array2)
{1,1,2,3,3,5}

Although you can convert the result back to an array with collect(m), if this is the kind of operation you want then you might consider not using arrays at all — just leave everything as a Multiset.

10 Likes

This is great! Thanks for the hint.

Although there is no “solution” to my answer, I will set this as the solution for future reference.

Thanks!

It’s actually great, that the best solution to the question “how to write better function?” is “change input data structure”. This is genuine out-of-the-box thinking, and it helps a lot in real world tasks.

2 Likes

Fully agree!

The multiset is actually implemented as a dict with integers, just like in the solutions above. But of course it’s faster to use the pre-made version if it’s available

1 Like