Sum of doubles

I am trying to speed up my solution to the following exercise:

Given a Array of int values (from 1 to p), return the sum of doubles. If one of the values occurs only once, it does not count towards the sum.

  • loneSum([1, 2, 3]) → 0
  • loneSum([3, 2, 3]) → 6
  • loneSum([3, 3, 3]) → 9

The current solution:

using BenchmarkTools
function sumofdoubles(values, p)
    sum = 0
    counts = zeros(UInt16, p)

    @inbounds for value in values
        counts[value] += 1
    end

    @inbounds for (i, v) in enumerate(counts)
        (v > 1) && (sum += i * v)
    end

    sum
end

p = 10^4
x = rand(1:p, 10^8)

@btime sumofdoubles($x, $p) #68.616 ms (2 allocations: 19.64 KiB)

Does anyone have an idea how to make it even faster?

Do you have the original task description? There are quite some question marks:

Given a Array of int values (from 1 to p), return the sum of doubles.

There are no doubles in 1:p, so probably you meant (of length p)

The loneSum examples work on a single array, your sumofdoubles(values, p) has two inputs

If the second parameter is just to pass the length (could be inferred from values), then your actual running example is not consistent 10^4 vs. 10^8.

I would go for building up a set (not dict), check for membership and add up.
But wait, your solution is better, you just count and multiply later, instead of (more costly) adding values.

Sorry my post was not very clear. This is only a toy exercise, but still fun to see how far one can go with performance. I expanded on the original task description:

Given a Array of int values (all between 0 and 9), return the sum of doubles. If one of the values occurs only once, it does not count towards the sum.

  • loneSum([1, 2, 3]) → 0
  • loneSum([3, 2, 3]) → 6
  • loneSum([3, 3, 3]) → 9

I wanted to expand on the original question. In my case there are 10^8 array elements (instead of 3) where each element is a random integer between 1 and 10^4 (instead of 0 to 9).

If i didn’t pass p as an argument it would already take 54 ms to find the maximum number (which is 10^4) in the array.

Thank you for your suggestion, but I can’t see a way how a solution using a dict could be faster.

1 Like

The effective bandwidth is

julia> GBpS=(8*10^8)/(68.61*10^6)
11.660107855997667

if I only take into account the big array reads.

With one thread, It may be close to the max Read From RAM bandwidth of your computer… How long does it takes to simply read values array ?

1 Like

You could eliminate all the multiplications by summing all the values and then subtracting the values which only occur once.

The second loop cost almost nothing. All the time is spent in the count array definition.

You could use LoopVectorization.jl – replace @inbounds with @turbo from LoopVectorization.jl and it should go much faster.

That could be it. How can this be tested without it getting optimized away by the compiler?
When I profile the code most of the time is spent on getindex.

I just tried that and when i simply replace @indbounds with @turbo i get the following error:

LoadError: MethodError: no method matching canonicalize_range(::Vector{Int64})

Would there no be a problem with a race condition anyway?

I guess that a linear sum of array’s elements (maybe not the Julia’s Base.sum based on reduce) time is mostly dominated by the read bandwidth for arrays larger that L3 cache. This should gives you an upper bound for your algorithm performance.

Another way to see it is to use a smaller array (<1.e6 elts) and you should see a large performance (per elt) bump because BenchmarkTools will keep a hot cache.

2 Likes

That must be the limiting factor. A simple sum(x) already takes 53.779 ms.
Thank you for helping me understand the bottleneck.

1 Like

You may have some acceleration with threads (x2 on dual channels or 4x on quad channels) but you must replicate the counters to avoid data races.
Another way is to reduce the size of your Integers : try it with values of type Vector{Int32}

1 Like

Let’s leave at this. My friend who used Java with the same parameters had a time of 134 ms. I just wanted to flex on him with Julia.

1 Like

You’d need something like:

function sumofdoubles(values, p)
    sum = 0
    counts = zeros(UInt16, p)

    @turbo for i in eachindex(values)
        value = values[i]
        counts[value] += 1
    end

    @inbounds for (i, v) in enumerate(counts)
        (v > 1) && (sum += i * v)
    end

    sum
end

Or, alternatively, counts[values[i]] += 1.
Here is the associated issue.

Yes, the above would get an incorrect answer if multiple of the same value occur close enough together.
It also requires very slow memory operations, so it’s unlikely to really be faster anyway: you can’t really speed up non-consecutive memory operations with SIMD, unless they have particular patterns allowing you to use consecutive memory accesses + shuffles.
If the rest of the computation is expensive enough, SIMD can still help.
Here, the rest of the computation (+=1) is not expensive.

And unlike threads, SIMD can’t help with memory bottlenecks.

3 Likes

I was just trying to do it differently, (without success!)
Anyway, creating two BitVectors to figure out if an array already occurred once or twice needs less memory, but is slower on my computer.

using BenchmarkTools
function sumofdoubles_bitarray(values, p)
    s = 0
    first_val = trues(p)
    second_val = trues(p)

    @inbounds for (i, v) in enumerate(values)
        if first_val[v]
            first_val[v] = false
        else
            if second_val[v]
                s += 2*v
                second_val[v] = false
            else
                s += v
            end
        end
    end

    s
end

p = 10^4
x = rand(1:p, 10^8)

@btime sumofdoubles_bitarray($x, $p) # 173.226 ms (4 allocations: 2.72 KiB)
# for comparison
@btime sumofdoubles($x, $p) # 102.769 ms (2 allocations: 19.64 KiB)
1 Like

Again the second loop consumes little time and the performance is memory bound.
The following code (marginally improved ) performance gives the following results on my laptop:

t_with_Int64 = 0.079648421
t_with_Int32 = 0.056981823
t_1_loop_with_Int32 = 0.067936515
using BenchmarkTools
using Random

Random.seed!(1234)

function sumofdoubles(values, p)
    sum = 0
    counts = zeros(UInt16,p)
    _1 = one(UInt16) # this seems to give a small accelaration

    @inbounds for value in values
        counts[value] += _1
    end

    for (i, v) in enumerate(counts)
        (v > 1) && (sum += i * v)
    end

    sum
end

function only_counts(values, p)
    sum = 0
    counts = zeros(UInt16,p)
    _1 = one(UInt16)

    @inbounds for value in values
        counts[value] += _1
    end

    sum+=Int(counts[1])
end

p = 10^4
x = rand(1:p, 10^8)

t_with_Int64=@belapsed sumofdoubles($x, $p)

x32 = Vector{Int32}(x)
t_with_Int32=@belapsed sumofdoubles($x32, $p)

x32 = Vector{Int32}(x)
t_1_loop_with_Int32=@belapsed only_counts($x32, $p)
@show t_with_Int64
@show t_with_Int32
@show t_1_loop_with_Int32

1 Like

Nice idea. I wonder why it is slower given it uses so much less memory.

You can use slightly less memory by going with a DiBitVector from DataStructures.jl but the performance is the same.