Can someone tell me how to simplify (or make more) my following function.
It works. I am just curious if (and quite certain) there is a simpler way.
I have two arrays and I want to remove from first array all instances values that are in the second array.
The following needs to be ensured:
if first array contains multiple instances of values of second array only the number of instances that are in second array should be removed from first array
if second array contains values that are not in first array nothing needs to be removed
So essentially it is like removing cards from a deck where a deck may include multiple instances of the same card.
I think it’s faster for large numbers if you count the elements in arr2 first, then go over arr1 once and update the dict with how many elements you delete. Then do the mask index once. In your version you loop over the array again and again (partially)
julia> function remove(arr1, arr2)
d = Dict{eltype(arr2), Int}()
for a in arr2
d[a] = get!(d, a, 0) + 1
end
mask = ones(Bool, length(arr1))
for (i, a) in enumerate(arr1)
x = get(d, a, 0)
if x > 0
d[a] = x - 1
mask[i] = false
end
end
arr1[mask]
end
julia> remove(array1, array2)
6-element Vector{Int64}:
1
1
3
5
3
2
Thanks for the suggestion. It doesn’t exactly do what my code does, because it doesn’t change the original array. But if gives me some great new approaches. Thanks!
function remove!(out, deck)
removable = countmap(out)
to_remove = Int[]
for (i,card) in deck
if haskey(removable, card)
push!(i,to_remove)
removable[card] -= 1
removable[card] == 0 && deleteat!(removable, card)
end
end
deleteat!.(deck, reverse(to_remove))
return deck
end
Perhaps something like this? You iterate over out and deck once each and otherwise are hitting hash indices I think.
I think countmap is maybe from statsbase?
On phone so I couldn’t make sure that deleteat! works for dicts
You can do the in place change at the end. I wouldn’t do it with deleteat everytime you find a value because then the whole tail of the array has to be moved. Just mutate the array once at the end.
Thanks to everyone for your suggestions. Also my take away is that there isn’t an easy oneliner that could do the trick. Which is kind of comforting to realize.
I’m not sure if this will speed things up, but if you have the option you might consider sorting the lists and then using things like searchsortedfirst instead of findfirst. In general, I would guess that if you were willing to sort the lists before you do this you could write something that is a bit more manual but pretty quick.
EDIT: okay, didn’t read the docs of searchsortedfirst very closely, sorry. But still, sorting might be a helpful idea.
I am not an expert, so don’t take this answer too seriously.
I don’t know if this is a bad practice and would like to get the opinion of someone more experienced:
remove!(deck, out) =
for i in out
indx = findfirst(==(i), deck)
indx |> isnothing ?
nothing :
deleteat!(deck, indx)
end
Another solution that seems to work (similar to the one proposed by jules):
find_n_indx(array, number, first_n) = # get indices of first n occurrences of number in array
(index for (index, value) in pairs(array) if value == number) |>
x -> Iterators.take(x, first_n) |>
collect
function remove_2!(deck, out)
s = ((i, count(==(i), out)) for i in Set(out)) # unique out elements with relative frequencies in out [(out_elem, count out_elem in out), ..]
indexes_to_remove = vcat((find_n_indx(deck, i[1], i[2]) for i in s)...)
sort!(indexes_to_remove)
deleteat!(deck, indexes_to_remove)
nothing
end
But probably there are better solutions to calculate the indexes to remove
deleteat! is rather heavy operation. Faster is to do some sort of double cursor, when you shift elements as you go. In this case you can achieve O(N) complexity.
function mycountmap(arr)
res = Dict{Int, Int}()
for x in arr
res[x] = get(res, x, 0) + 1
end
return res
end
function cleanup!(arr1, arr2)
cmap = mycountmap(arr2)
i1 = 1
@inbounds for (i2, v) in pairs(arr1)
i1 != i2 && (arr1[i1] = v)
cval = get(cmap, v, 0)
if cval > 0
cmap[v] = cval - 1
else
i1 += 1
end
end
resize!(arr1, i1 - 1)
return arr1
end
julia> array1 = [1,1,1,2,2,3,4,5,3,2];
julia> array2 = [34, 2, 4, 1, 2];
julia> cleanup!(x, array2)
6-element Vector{Int64}:
1
1
3
5
3
2
UPD: This one is slower then initial version with find and delete. mycountmap take it’s time to initialize and generates overhead. So this approach is useful, when the size of the deck and out is big enough, that update and find time is much larger than mycountmap construction.
I took another swing as well, this time with a julia instance to check code against:
function remove!(out::Vector{Int64}, deck::Vector{Int64})
# Iterate once over `out` to make a DefaultDict
removable = Dict{eltype(out),Int64}()
cnt = 0
for v in out
if haskey(removable, v)
removable[v] += 1
else
removable[v] = 1
end
cnt += 1
end
# Iterate once over `deck` to do a one-pass version of findfirst
to_remove = BitArray(undef,length(deck))
@inbounds for (i,card) in enumerate(deck)
res = haskey(removable, card)
to_remove[i] = res
if res
if isone(cnt)
to_remove[(i+1):end] .= false
break
end
cnt -= 1
if isone(removable[card])
pop!(removable, card)
else
removable[card] -= 1
end
end
end
# remove cards using the BitArray
deleteat!(deck, to_remove)
end
This seems to perform worse than the OP’s code at small deck sizes and only marginally better at huge out/deck sizes. I am a bit flummoxed as to why. findfirst and deleteat! must both be way more efficient than I expected.
What you are describing is a multiset difference, and the easiest (and perhaps most efficient) way would probably be to use the Multisets.jl package:
julia> using Multisets
julia> m = Multiset(array1) - Multiset(array2)
{1,1,2,3,3,5}
Although you can convert the result back to an array with collect(m), if this is the kind of operation you want then you might consider not using arrays at all — just leave everything as a Multiset.
It’s actually great, that the best solution to the question “how to write better function?” is “change input data structure”. This is genuine out-of-the-box thinking, and it helps a lot in real world tasks.
The multiset is actually implemented as a dict with integers, just like in the solutions above. But of course it’s faster to use the pre-made version if it’s available