I am trying to speed up my solution to the following exercise:
Given a Array of int values (from 1 to p), return the sum of doubles. If one of the values occurs only once, it does not count towards the sum.
loneSum([1, 2, 3]) → 0
loneSum([3, 2, 3]) → 6
loneSum([3, 3, 3]) → 9
The current solution:
using BenchmarkTools
function sumofdoubles(values, p)
sum = 0
counts = zeros(UInt16, p)
@inbounds for value in values
counts[value] += 1
end
@inbounds for (i, v) in enumerate(counts)
(v > 1) && (sum += i * v)
end
sum
end
p = 10^4
x = rand(1:p, 10^8)
@btime sumofdoubles($x, $p) #68.616 ms (2 allocations: 19.64 KiB)
Does anyone have an idea how to make it even faster?
Do you have the original task description? There are quite some question marks:
Given a Array of int values (from 1 to p), return the sum of doubles.
There are no doubles in 1:p, so probably you meant (of length p)
The loneSum examples work on a single array, your sumofdoubles(values, p) has two inputs
If the second parameter is just to pass the length (could be inferred from values), then your actual running example is not consistent 10^4 vs. 10^8.
I would go for building up a set (not dict), check for membership and add up.
But wait, your solution is better, you just count and multiply later, instead of (more costly) adding values.
Sorry my post was not very clear. This is only a toy exercise, but still fun to see how far one can go with performance. I expanded on the original task description:
Given a Array of int values (all between 0 and 9), return the sum of doubles. If one of the values occurs only once, it does not count towards the sum.
loneSum([1, 2, 3]) → 0
loneSum([3, 2, 3]) → 6
loneSum([3, 3, 3]) → 9
I wanted to expand on the original question. In my case there are 10^8 array elements (instead of 3) where each element is a random integer between 1 and 10^4 (instead of 0 to 9).
If i didn’t pass p as an argument it would already take 54 ms to find the maximum number (which is 10^4) in the array.
Thank you for your suggestion, but I can’t see a way how a solution using a dict could be faster.
That could be it. How can this be tested without it getting optimized away by the compiler?
When I profile the code most of the time is spent on getindex.
I guess that a linear sum of array’s elements (maybe not the Julia’s Base.sum based on reduce) time is mostly dominated by the read bandwidth for arrays larger that L3 cache. This should gives you an upper bound for your algorithm performance.
Another way to see it is to use a smaller array (<1.e6 elts) and you should see a large performance (per elt) bump because BenchmarkTools will keep a hot cache.
You may have some acceleration with threads (x2 on dual channels or 4x on quad channels) but you must replicate the counters to avoid data races.
Another way is to reduce the size of your Integers : try it with values of type Vector{Int32}
function sumofdoubles(values, p)
sum = 0
counts = zeros(UInt16, p)
@turbo for i in eachindex(values)
value = values[i]
counts[value] += 1
end
@inbounds for (i, v) in enumerate(counts)
(v > 1) && (sum += i * v)
end
sum
end
Or, alternatively, counts[values[i]] += 1. Here is the associated issue.
Yes, the above would get an incorrect answer if multiple of the same value occur close enough together.
It also requires very slow memory operations, so it’s unlikely to really be faster anyway: you can’t really speed up non-consecutive memory operations with SIMD, unless they have particular patterns allowing you to use consecutive memory accesses + shuffles.
If the rest of the computation is expensive enough, SIMD can still help.
Here, the rest of the computation (+=1) is not expensive.
And unlike threads, SIMD can’t help with memory bottlenecks.
I was just trying to do it differently, (without success!)
Anyway, creating two BitVectors to figure out if an array already occurred once or twice needs less memory, but is slower on my computer.
using BenchmarkTools
function sumofdoubles_bitarray(values, p)
s = 0
first_val = trues(p)
second_val = trues(p)
@inbounds for (i, v) in enumerate(values)
if first_val[v]
first_val[v] = false
else
if second_val[v]
s += 2*v
second_val[v] = false
else
s += v
end
end
end
s
end
p = 10^4
x = rand(1:p, 10^8)
@btime sumofdoubles_bitarray($x, $p) # 173.226 ms (4 allocations: 2.72 KiB)
# for comparison
@btime sumofdoubles($x, $p) # 102.769 ms (2 allocations: 19.64 KiB)
Again the second loop consumes little time and the performance is memory bound.
The following code (marginally improved ) performance gives the following results on my laptop:
using BenchmarkTools
using Random
Random.seed!(1234)
function sumofdoubles(values, p)
sum = 0
counts = zeros(UInt16,p)
_1 = one(UInt16) # this seems to give a small accelaration
@inbounds for value in values
counts[value] += _1
end
for (i, v) in enumerate(counts)
(v > 1) && (sum += i * v)
end
sum
end
function only_counts(values, p)
sum = 0
counts = zeros(UInt16,p)
_1 = one(UInt16)
@inbounds for value in values
counts[value] += _1
end
sum+=Int(counts[1])
end
p = 10^4
x = rand(1:p, 10^8)
t_with_Int64=@belapsed sumofdoubles($x, $p)
x32 = Vector{Int32}(x)
t_with_Int32=@belapsed sumofdoubles($x32, $p)
x32 = Vector{Int32}(x)
t_1_loop_with_Int32=@belapsed only_counts($x32, $p)
@show t_with_Int64
@show t_with_Int32
@show t_1_loop_with_Int32