Performance Help, Algorithm Is Not Scaling Well

I have an array with arrays inside. I created an algorithm to find duplicated elements in arrays in the array. It works correctly, but when I start scaling up to thousands or tens of thousands of records, then it is not efficient. How can I make this more efficient? How could I approach this problem differently?

# The goal here is to identify which arrays have duplicated items
# even if the elements are in a different order. Secondly, I want
# to keep the array with the lowest "test" score.
let
    check = 1
    lowestTest = 9999
    arrEnd = []
    arrStart = []
    push!(arrStart, (elements = [1,2], test = 1))
    push!(arrStart, (elements = [3,1,2], test = 7)) #duplicate
    push!(arrStart, (elements = [1,4,5], test = 3))
    push!(arrStart, (elements = [3,2,1], test = 4)) #duplicate
    push!(arrStart, (elements = [4,5,6,2], test = 5))
    push!(arrStart, (elements = [1,2,3], test = 6)) #duplicate
    push!(arrStart, (elements = [2,3,1], test = 2)) #duplicate
    push!(arrStart, (elements = [2,1,3], test = 8)) #duplicate
    for x in arrStart
        lowestTest = x.test
        for y in arrStart
            if x.elements != y.elements && length(x.elements) == length(y.elements)
                check = 1
                for xx in x.elements
                    if in(xx, y.elements)
                        check *= 1
                    else
                        check *= 0
                    end
                end
                if check == 1
                    if x.test > y.test
                        lowestTest = y.test
                    end
                end
            end
        end
        # if it has the lowest score, then it wins
        if x.test == lowestTest
            push!(arrEnd,x)
        end
    end
    # check and see if it worked
    for x in arrEnd
        println(x)
    end
end

Have you looked at https://docs.julialang.org/en/v1/manual/performance-tips/ ?

In particular you should first try wrapping your code in a function.

2 Likes

Without getting into all the details, I think you can speed this up a lot by sorting your arrays. As I understand your code now, you are one by one checking whether any single array is equal to any of the other arrays. This is clearly O(n^2) so I’d expect it to do poorly at scale.

Instead, if you first sorted all your arrays, you would no longer need to compare each array against all other arrays, but rather only next to it’s immediate neighbor. That means checking if there are duplicates is only O(n), with O(n \log(n) ) cost to sort the arrays.

1 Like

Can you explain further? How could I make [1,2,3], [3,2,1], [2,1,3] next to each other by sorting? Wouldn’t [1,4,5] end up between them?

I’m very new to Julia so take this with a grain of salt.

You should break out of the for xx in x.elements loop if in(xx, y.elements) is false, since there doesn’t appear to be a difference in the code if one element isn’t found vs multiple elements not found. In fact check could just be a boolean, once in(xx, y.elements) is false it gets set to false and you break.

Additionally you might consider a “pre-process” step to convert the elements array into an Set. That will vastly speed up the check for duplicate elements. (You would need to keep the length of the array so that [1, 1, 2, 3] isn’t not incorrectly tested as equal to [1, 2, 3].) And it would also remove duplicate checks, if 3 is in the array 5 times, you don’t need to check for 3 all 5 times.

4 Likes

That’s your problem. Those arrays are not typed. arrEnd = Int[] etc.

4 Likes

How could I make [1,2,3], [3,2,1], [2,1,3] next to each other by sorting?

If I understand your problem correctly, the order in the individual arrays don’t matter. So you should also sort your individual arrays too.

In fact, you should be able to get all the unique sorted arrays by

unq_sorted_arrays = unique(sort.(arrayOfArrays) ) 

This should scale much better if I understand the problem correctly. But I may be mistaken on what you need.

1 Like

I’ve made some improvements as some of you suggested. I’ve added a sorted column to the array. I realized that I could use filter on the sorted column to isolate the specific rows to compare. I was able to eliminate the innermost loop as well.

With 30,000 rows or so it still takes a couple of minutes to run. I’m sure there is more I could do to improve it.

New Code:

# The goal here is to identify which arrays have duplicated items
# even if the elements are in a different order. Secondly, I want
# to keep the array with the lowest "test" score.
let
    check = 1
    lowestTest = 9999
    arrEnd = []
    arrMiddle = []
    arrStart = []
    push!(arrStart, (elements = [1,2], test = 1))
    push!(arrStart, (elements = [3,1,2], test = 7)) #duplicate
    push!(arrStart, (elements = [1,4,5], test = 3))
    push!(arrStart, (elements = [3,2,1], test = 4)) #duplicate
    push!(arrStart, (elements = [4,5,6,2], test = 5))
    push!(arrStart, (elements = [1,2,3], test = 6)) #duplicate
    push!(arrStart, (elements = [2,3,1], test = 2)) #duplicate
    push!(arrStart, (elements = [2,1,3], test = 8)) #duplicate
    for x in arrStart
        push!(arrMiddle,(elements = x.elements, test = x.test, elemSort = sort(x.elements)))
    end
    sort!(arrMiddle, by = x -> x[3])
    for x in arrMiddle
        lowestTest = x.test
        for y in filter(r -> r.test < lowestTest && r.elemSort == x.elemSort, arrMiddle)
            if x.test > y.test
                lowestTest = y.test
            end
        end
        if x.test == lowestTest
            push!(arrEnd,x)
        end
    end
    print("\n")
    for x in arrEnd
        println(x)
    end
end

You’ve still got untyped arrays. Give @ChrisRackauckas’s suggestion a try.

1 Like

Check out for instance arrEnd. What is the type? If it is to be an array of Ints, say so: arrEnd = Int64[].

Any particular reason you construct the arrays via push!? It will be more efficient if you construct them directly, or at least use sizehint!, if you can estimate the size in advance.

1 Like

Please clarify what you want. Especially clarify how you want to deal with (elements = [1,2,2], test=1). Are you dealing with “Vector without duplicates, modulo permutations” aka “Set”, or are you dealing with “Vector modulo permuations” aka sets with multiplicity?

#cf https://github.com/JuliaLang/julia/pull/31367
function update!(combine, h::Dict{K,V}, key, val)  where {K,V}
    idx = Base.ht_keyindex2!(h, key)
    if idx > 0
        @inbounds h.keys[idx] = convert(K, key)
        @inbounds vold = h.vals[idx]
        vnew = combine(vold, val)
        @inbounds h.vals[idx] = vnew
        return vnew
    else
        vnew = combine(val)
        @inbounds Base._setindex!(h, vnew, key, -idx)
        return vnew
    end
end

_combine(a,b) = (a[1]<b[1] ? a : b)
_combine(a) = a

function whatever(arrStart)
acc = Dict{Vector{Int}, Tuple{Int,Int}}()
sizehint!(acc, length(arrStart))
for i in 1:length(arrStart)
x = arrStart[i]
e,t = x.elements, x.test
update!(_combine, acc, sort(e), (t, i))
end
return [arrStart[idx] for (k,(t,idx)) in acc]
end

yielding

julia> whatever(arrStart)
4-element Array{NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}},1}:
 (elements = [2, 3, 1], test = 2)   
 (elements = [1, 2], test = 1)      
 (elements = [4, 5, 6, 2], test = 5)
 (elements = [1, 4, 5], test = 3)  
1 Like

Chris/Stephan/Petr, what type is this array push!(arrStart, (elements = [1,2], test = 1)) ? It seems to be a mix. I don’t know how to do this.

Simeon, I constructed the array using push! just to have an example array. I have a larger projected with a larger array that is already constructed. I did not know about sizehint!, so thank you! I’ll have to learn more about that.

foobar, your code is much more efficient than mine at scale. Wow! It just looks like magic to me, so I need to take some time to study it. Thank you!

EDIT: FYI - that box to check that the problem is solved is not available to me for some reason.

Usually, I would just ask Julia to tell me what type it is.

a = (elements = [1,2], test = 1)
typeof(a)

It should tell you

NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}}

So now, knowing it is a NamedTuple, we can initialize the array these ways:

arrStart = NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}}[]

t = typeof((elements = [1,2], test = 1))
arrStart = Array{t}(undef,0)

Also, if you know the value inside the tuple can be mixed, for example, it could be Int or Float, but not Complex, then initialize this way…

arrStart = NamedTuple{(:elements, :test),Tuple{Array{Real,1},Real}}[]
1 Like

Simplified:

function whatever(arrStart)
acc = Dict{Vector{Int}, Tuple{Int,Int}}()
sizehint!(acc, length(arrStart))
for i in 1:length(arrStart)
x = arrStart[i]
e,t = x.elements, x.test
es = sort(e)
if haskey(acc, es)
tst,idx = acc[es]
if tst >= t
   acc[es] = (t,i)
end
else
    acc[es] = (t,i)
end
end
return [arrStart[idx] for (k,(t,idx)) in acc]
end

The main point is that you want to minimize test over all key => test pairs with matching key. That is what hash-tables / associative arrays / dictionaries / unordered maps are made for.

The julia base dictionary API lacks convenient update functions that search the hash table only once, which is very relevant when hash/comparison is expensive (as in your case, where key is itself a set). This is just cosmetics (factor 1-3 speedup). Since I happened to make a PR to that effect, I could not resist showing off my proposed API.

Since your keys (unordered list of integers) are sortable, it is faster to normalize them by sorting than by using the built-in Set.

1 Like

foobar_lv2, thank you for that simplified example. It is much easier to follow for me. Julia is so fast. I was hoping to maybe improve the algorithm by 10x, but now it executes almost instantly, even with 10s of thousands of records. I would mark your answer as the solution, but I am not seeing the solution checkbox under any of the posts.

cchderrick, thank you for the examples on creating arrays with multiple types. Now, I know how to do that.

I do appreciate the community here. Thank you all!