 # Performance Help, Algorithm Is Not Scaling Well

I have an array with arrays inside. I created an algorithm to find duplicated elements in arrays in the array. It works correctly, but when I start scaling up to thousands or tens of thousands of records, then it is not efficient. How can I make this more efficient? How could I approach this problem differently?

``````# The goal here is to identify which arrays have duplicated items
# even if the elements are in a different order. Secondly, I want
# to keep the array with the lowest "test" score.
let
check = 1
lowestTest = 9999
arrEnd = []
arrStart = []
push!(arrStart, (elements = [1,2], test = 1))
push!(arrStart, (elements = [3,1,2], test = 7)) #duplicate
push!(arrStart, (elements = [1,4,5], test = 3))
push!(arrStart, (elements = [3,2,1], test = 4)) #duplicate
push!(arrStart, (elements = [4,5,6,2], test = 5))
push!(arrStart, (elements = [1,2,3], test = 6)) #duplicate
push!(arrStart, (elements = [2,3,1], test = 2)) #duplicate
push!(arrStart, (elements = [2,1,3], test = 8)) #duplicate
for x in arrStart
lowestTest = x.test
for y in arrStart
if x.elements != y.elements && length(x.elements) == length(y.elements)
check = 1
for xx in x.elements
if in(xx, y.elements)
check *= 1
else
check *= 0
end
end
if check == 1
if x.test > y.test
lowestTest = y.test
end
end
end
end
# if it has the lowest score, then it wins
if x.test == lowestTest
push!(arrEnd,x)
end
end
# check and see if it worked
for x in arrEnd
println(x)
end
end
``````

Have you looked at https://docs.julialang.org/en/v1/manual/performance-tips/ ?

In particular you should first try wrapping your code in a function.

2 Likes

Without getting into all the details, I think you can speed this up a lot by sorting your arrays. As I understand your code now, you are one by one checking whether any single array is equal to any of the other arrays. This is clearly O(n^2) so I’d expect it to do poorly at scale.

Instead, if you first sorted all your arrays, you would no longer need to compare each array against all other arrays, but rather only next to it’s immediate neighbor. That means checking if there are duplicates is only O(n), with O(n \log(n) ) cost to sort the arrays.

1 Like

Can you explain further? How could I make [1,2,3], [3,2,1], [2,1,3] next to each other by sorting? Wouldn’t [1,4,5] end up between them?

I’m very new to Julia so take this with a grain of salt.

You should break out of the for xx in x.elements loop if in(xx, y.elements) is false, since there doesn’t appear to be a difference in the code if one element isn’t found vs multiple elements not found. In fact check could just be a boolean, once in(xx, y.elements) is false it gets set to false and you break.

Additionally you might consider a “pre-process” step to convert the elements array into an Set. That will vastly speed up the check for duplicate elements. (You would need to keep the length of the array so that [1, 1, 2, 3] isn’t not incorrectly tested as equal to [1, 2, 3].) And it would also remove duplicate checks, if 3 is in the array 5 times, you don’t need to check for 3 all 5 times.

4 Likes

That’s your problem. Those arrays are not typed. `arrEnd = Int[]` etc.

4 Likes

How could I make [1,2,3], [3,2,1], [2,1,3] next to each other by sorting?

If I understand your problem correctly, the order in the individual arrays don’t matter. So you should also sort your individual arrays too.

In fact, you should be able to get all the unique sorted arrays by

``````unq_sorted_arrays = unique(sort.(arrayOfArrays) )
``````

This should scale much better if I understand the problem correctly. But I may be mistaken on what you need.

1 Like

I’ve made some improvements as some of you suggested. I’ve added a sorted column to the array. I realized that I could use filter on the sorted column to isolate the specific rows to compare. I was able to eliminate the innermost loop as well.

With 30,000 rows or so it still takes a couple of minutes to run. I’m sure there is more I could do to improve it.

New Code:

``````# The goal here is to identify which arrays have duplicated items
# even if the elements are in a different order. Secondly, I want
# to keep the array with the lowest "test" score.
let
check = 1
lowestTest = 9999
arrEnd = []
arrMiddle = []
arrStart = []
push!(arrStart, (elements = [1,2], test = 1))
push!(arrStart, (elements = [3,1,2], test = 7)) #duplicate
push!(arrStart, (elements = [1,4,5], test = 3))
push!(arrStart, (elements = [3,2,1], test = 4)) #duplicate
push!(arrStart, (elements = [4,5,6,2], test = 5))
push!(arrStart, (elements = [1,2,3], test = 6)) #duplicate
push!(arrStart, (elements = [2,3,1], test = 2)) #duplicate
push!(arrStart, (elements = [2,1,3], test = 8)) #duplicate
for x in arrStart
push!(arrMiddle,(elements = x.elements, test = x.test, elemSort = sort(x.elements)))
end
sort!(arrMiddle, by = x -> x)
for x in arrMiddle
lowestTest = x.test
for y in filter(r -> r.test < lowestTest && r.elemSort == x.elemSort, arrMiddle)
if x.test > y.test
lowestTest = y.test
end
end
if x.test == lowestTest
push!(arrEnd,x)
end
end
print("\n")
for x in arrEnd
println(x)
end
end
``````

You’ve still got untyped arrays. Give @ChrisRackauckas’s suggestion a try.

1 Like

Check out for instance `arrEnd`. What is the type? If it is to be an array of Ints, say so: `arrEnd = Int64[]`.

Any particular reason you construct the arrays via `push!`? It will be more efficient if you construct them directly, or at least use `sizehint!`, if you can estimate the size in advance.

1 Like

Please clarify what you want. Especially clarify how you want to deal with `(elements = [1,2,2], test=1)`. Are you dealing with “Vector without duplicates, modulo permutations” aka “Set”, or are you dealing with “Vector modulo permuations” aka sets with multiplicity?

``````#cf https://github.com/JuliaLang/julia/pull/31367
function update!(combine, h::Dict{K,V}, key, val)  where {K,V}
idx = Base.ht_keyindex2!(h, key)
if idx > 0
@inbounds h.keys[idx] = convert(K, key)
@inbounds vold = h.vals[idx]
vnew = combine(vold, val)
@inbounds h.vals[idx] = vnew
return vnew
else
vnew = combine(val)
@inbounds Base._setindex!(h, vnew, key, -idx)
return vnew
end
end

_combine(a,b) = (a<b ? a : b)
_combine(a) = a

function whatever(arrStart)
acc = Dict{Vector{Int}, Tuple{Int,Int}}()
sizehint!(acc, length(arrStart))
for i in 1:length(arrStart)
x = arrStart[i]
e,t = x.elements, x.test
update!(_combine, acc, sort(e), (t, i))
end
return [arrStart[idx] for (k,(t,idx)) in acc]
end
``````

yielding

``````julia> whatever(arrStart)
4-element Array{NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}},1}:
(elements = [2, 3, 1], test = 2)
(elements = [1, 2], test = 1)
(elements = [4, 5, 6, 2], test = 5)
(elements = [1, 4, 5], test = 3)
``````
1 Like

Chris/Stephan/Petr, what type is this array `push!(arrStart, (elements = [1,2], test = 1))` ? It seems to be a mix. I don’t know how to do this.

Simeon, I constructed the array using `push!` just to have an example array. I have a larger projected with a larger array that is already constructed. I did not know about `sizehint!`, so thank you! I’ll have to learn more about that.

foobar, your code is much more efficient than mine at scale. Wow! It just looks like magic to me, so I need to take some time to study it. Thank you!

EDIT: FYI - that box to check that the problem is solved is not available to me for some reason.

Usually, I would just ask Julia to tell me what type it is.

``````a = (elements = [1,2], test = 1)
typeof(a)
``````

It should tell you

``````NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}}
``````

So now, knowing it is a `NamedTuple`, we can initialize the array these ways:

``````arrStart = NamedTuple{(:elements, :test),Tuple{Array{Int64,1},Int64}}[]

t = typeof((elements = [1,2], test = 1))
arrStart = Array{t}(undef,0)
``````

Also, if you know the value inside the tuple can be mixed, for example, it could be Int or Float, but not Complex, then initialize this way…

``````arrStart = NamedTuple{(:elements, :test),Tuple{Array{Real,1},Real}}[]
``````
1 Like

Simplified:

``````function whatever(arrStart)
acc = Dict{Vector{Int}, Tuple{Int,Int}}()
sizehint!(acc, length(arrStart))
for i in 1:length(arrStart)
x = arrStart[i]
e,t = x.elements, x.test
es = sort(e)
tst,idx = acc[es]
if tst >= t
acc[es] = (t,i)
end
else
acc[es] = (t,i)
end
end
return [arrStart[idx] for (k,(t,idx)) in acc]
end
``````

The main point is that you want to minimize `test` over all `key => test` pairs with matching `key`. That is what hash-tables / associative arrays / dictionaries / unordered maps are made for.

The julia base dictionary API lacks convenient update functions that search the hash table only once, which is very relevant when hash/comparison is expensive (as in your case, where `key` is itself a set). This is just cosmetics (factor 1-3 speedup). Since I happened to make a PR to that effect, I could not resist showing off my proposed API.

Since your keys (unordered list of integers) are sortable, it is faster to normalize them by sorting than by using the built-in `Set`.

1 Like

foobar_lv2, thank you for that simplified example. It is much easier to follow for me. Julia is so fast. I was hoping to maybe improve the algorithm by 10x, but now it executes almost instantly, even with 10s of thousands of records. I would mark your answer as the solution, but I am not seeing the solution checkbox under any of the posts.

cchderrick, thank you for the examples on creating arrays with multiple types. Now, I know how to do that.

I do appreciate the community here. Thank you all!