Why isnβt the list-comprehension as fast as the for-loop?
using Random
using BenchmarkTools
y = zeros(UInt8,256);
y[UInt8('A')] = UInt8('U');
y[UInt8('C')] = UInt8('G');
y[UInt8('G')] = UInt8('C');
y[UInt8('T')] = UInt8('A');
n = 100000000;
x = String([ rand("ACGT") for i in 1:n ]);
function foo1(x,y)
[ @inbounds y[i] for i in transcode(UInt8, x) ]
end
function foo2(x,y)
x = transcode(UInt8, x)
z = Vector{UInt8}(undef, length(x))
for i in 1:length(z)
@inbounds z[i] = y[x[i]]
end
return z
end
julia> @btime foo1(x,y);
110.531 ms (2 allocations: 95.37 MiB)
julia> @btime foo2(x,y);
93.206 ms (2 allocations: 95.37 MiB)
Now I see it, your functions arenβt equivalent.
foo2 should be:
julia> function foo2(x,y)
x = transcode(UInt8, x)
z = Vector{UInt8}(undef, length(x))
index=1
for i in x
@inbounds z[index] = y[i]
index+=1
end
return z
end
With that I get:
julia> @btime foo2(x,y);
50.135 ms (2 allocations: 95.37 MiB)
I donβt understand assembly, but interesting to see that the compiler seem not be able to reduce some small overhead in what one can consider a simple loop.
julia> @btime foo1(x,y);
80.580 ms (2 allocations: 95.37 MiB)
julia> @btime foo4(x,y);
92.395 ms (2 allocations: 95.37 MiB)
which is quite astounding.
In 1.6.0 the timings are nearly identical.
I have installed 1.5.4 now and I get:
julia> @btime foo1(x,y);
80.843 ms (2 allocations: 95.37 MiB)
julia> @btime foo2(x,y);
81.314 ms (2 allocations: 95.37 MiB)
julia> @btime foo3(x,y);
81.172 ms (2 allocations: 95.37 MiB)
julia> @btime foo4(x,y);
93.333 ms (2 allocations: 95.37 MiB)
which doesnβt reproduce your timings, so my guess is, that there are differences in the hardware (CPU).
I donβt want to investigate why foo4 is slower despite that @code_lowered is exactly identical to foo1, because I suspect it to be very hard to find the cause.
I believe we see here the differences of the CPUs like perhaps L1,2,3 cache sizes.
Perhaps, for you, at some smaller n the timings become equal.
If I try with n=1_000_000_000 instead of n=100_000_000 I have the same step in the timings as you:
julia> @btime foo1(x,y);
572.141 ms (2 allocations: 953.67 MiB)
julia> @btime foo2(x,y);
821.871 ms (2 allocations: 953.67 MiB)
julia> @btime foo3(x,y);
551.218 ms (2 allocations: 953.67 MiB)
julia> @btime foo4(x,y);
572.606 ms (2 allocations: 953.67 MiB)
But proofing that this is caused by different CPUs is beyond my capability.
Can we maybe conclude comparing foo2 vs foo3 that extra getindex causes the difference in performance? So, list-comprehension performance in this case very okay.
Would be interesting to be able to profile code at llvm level.
Of course that is the case. I thought this was the easy part of this whole discussion. Conclusion: comprehensions, loops and generators can have similar timings (at least in 1.6. and if problem size is not too big).
After checking that as solved in my mind I was analyzing the different Julia versions and problem sizes, which let me think, that CPU differences play also a role when seeing differences in timings. Suspicion: In 1.6. at a CPU specific threshold, comprehension (and equivalent generator) are superior over loops (proof beyond my skill) in this special case.