# For-loop vs list-comprehension

Why isnβt the list-comprehension as fast as the for-loop?

``````using Random
using BenchmarkTools

y = zeros(UInt8,256);
y[UInt8('A')] = UInt8('U');
y[UInt8('C')] = UInt8('G');
y[UInt8('G')] = UInt8('C');
y[UInt8('T')] = UInt8('A');

n = 100000000;
x = String([ rand("ACGT") for i in 1:n ]);

function foo1(x,y)
[ @inbounds y[i] for i in transcode(UInt8, x) ]
end

function foo2(x,y)
x = transcode(UInt8, x)
z = Vector{UInt8}(undef, length(x))
for i in 1:length(z)
@inbounds  z[i] = y[x[i]]
end
return z
end
``````
``````julia> @btime foo1(x,y);
110.531 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
93.206 ms (2 allocations: 95.37 MiB)
``````
``````julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
``````

For me the comprehension is faster:

``````julia> @btime foo1(x,y);
56.660 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
80.200 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
``````

Now I see it, your functions arenβt equivalent.
foo2 should be:

``````julia> function foo2(x,y)
x = transcode(UInt8, x)
z = Vector{UInt8}(undef, length(x))
index=1
for i in x
@inbounds  z[index] = y[i]
index+=1
end
return z
end
``````

With that I get:

``````julia> @btime foo2(x,y);
50.135 ms (2 allocations: 95.37 MiB)
``````

This is a getindex less in the code.

ah, good point, yes, but I get slower times with that modification in v1.5.4:

``````function foo3(x,y)
x = transcode(UInt8, x)
z = Vector{UInt8}(undef, length(x))
j = 1
for i in x
@inbounds  z[j] = y[i]
j+=1
end
return z
end
``````
``````jjulia> @btime foo3(x,y);
132.790 ms (2 allocations: 95.37 MiB)

``````

On version v1.6.0

``````julia> @btime foo1(x,y);
84.296 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
79.005 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
75.708 ms (2 allocations: 95.37 MiB)
``````
``````julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.6.0)
CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
``````

I donβt understand assembly, but interesting to see that the compiler seem not be able to reduce some small overhead in what one can consider a simple loop.

Whatβs your hardware? you are on MacOS Intel Core m3 (? mobile CPU ?)
This is what I get in 1.5.3:

``````julia> @btime foo1(x,y);
81.065 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
81.338 ms (2 allocations: 95.37 MiB)
``````

The functions do compile quite different:

``````julia> @code_lowered foo1(x,y)
CodeInfo(
1 β %1 = Main.:(var"#3#4")
β   %2 = Core.typeof(y)
β   %3 = Core.apply_type(%1, %2)
β        #3 = %new(%3, y)
β   %5 = #3
β   %6 = Main.transcode(Main.UInt8, x)
β   %7 = Base.Generator(%5, %6)
β   %8 = Base.collect(%7)
βββ      return %8
)

julia> @code_lowered foo3(x,y)
CodeInfo(
1 β       x@_9 = x@_2
β         x@_9 = Main.transcode(Main.UInt8, x@_9)
β   %3  = Core.apply_type(Main.Vector, Main.UInt8)
β   %4  = Main.length(x@_9)
β         z = (%3)(Main.undef, %4)
β         j = 1
β   %7  = x@_9
β         @_6 = Base.iterate(%7)
β   %9  = @_6 === nothing
β   %10 = Base.not_int(%9)
βββ       goto #4 if not %10
2 β %12 = @_6
β         i = Core.getfield(%12, 1)
β   %14 = Core.getfield(%12, 2)
β         \$(Expr(:inbounds, true))
β   %16 = Base.getindex(y, i)
β         Base.setindex!(z, %16, j)
β         val = %16
β         \$(Expr(:inbounds, :pop))
β         val
β         j = j + 1
β         @_6 = Base.iterate(%7, %14)
β   %23 = @_6 === nothing
β   %24 = Base.not_int(%23)
βββ       goto #4 if not %24
3 β       goto #2
4 β       return z
)
``````

In the case of `foo3` we see two loops, one for `transcode` and the second the `for` loop. (wrong interpretation)

1 Like

Or perhaps a single regression in 1.5.4.

If we create `foo4` as

``````julia> function foo4(x,y)
collect( y[x] for x in transcode(UInt8, x) )
end
``````

which has identical lowered code as `foo1`:

``````julia> @code_lowered foo4(x,y)
CodeInfo(
1 β %1 = Main.:(var"#27#28")
β   %2 = Core.typeof(y)
β   %3 = Core.apply_type(%1, %2)
β        #27 = %new(%3, y)
β   %5 = #27
β   %6 = Main.transcode(Main.UInt8, x)
β   %7 = Base.Generator(%5, %6)
β   %8 = Main.collect(%7)
βββ      return %8
)
``````

I get in 1.5.3:

``````julia> @btime foo1(x,y);
80.580 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
92.395 ms (2 allocations: 95.37 MiB)
``````

which is quite astounding.
In 1.6.0 the timings are nearly identical.

I have installed 1.5.4 now and I get:

``````julia> @btime foo1(x,y);
80.843 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
81.314 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
81.172 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
93.333 ms (2 allocations: 95.37 MiB)
``````

which doesnβt reproduce your timings, so my guess is, that there are differences in the hardware (CPU).

I donβt want to investigate why foo4 is slower despite that @code_lowered is exactly identical to foo1, because I suspect it to be very hard to find the cause.

What I learned: stay with 1.6, donβt look back.

1 Like

I have a MacBookAir (2017) not the fastest. Many thanks

``````julia> @btime z1=foo1(x,y);
69.349 ms (2 allocations: 95.37 MiB)

julia> @btime z2=foo2(x,y);
117.741 ms (2 allocations: 95.37 MiB)

julia> @btime z3=foo3(x,y);
62.121 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
``````

I believe we see here the differences of the CPUs like perhaps L1,2,3 cache sizes.
Perhaps, for you, at some smaller n the timings become equal.
If I try with n=1_000_000_000 instead of n=100_000_000 I have the same step in the timings as you:

``````julia> @btime foo1(x,y);
572.141 ms (2 allocations: 953.67 MiB)

julia> @btime foo2(x,y);
821.871 ms (2 allocations: 953.67 MiB)

julia> @btime foo3(x,y);
551.218 ms (2 allocations: 953.67 MiB)

julia> @btime foo4(x,y);
572.606 ms (2 allocations: 953.67 MiB)
``````

But proofing that this is caused by different CPUs is beyond my capability.

Can we maybe conclude comparing foo2 vs foo3 that extra `getindex` causes the difference in performance? So, list-comprehension performance in this case very okay.

Would be interesting to be able to profile code at llvm level.

Of course that is the case. I thought this was the easy part of this whole discussion. Conclusion: comprehensions, loops and generators can have similar timings (at least in 1.6. and if problem size is not too big).

After checking that as solved in my mind I was analyzing the different Julia versions and problem sizes, which let me think, that CPU differences play also a role when seeing differences in timings. Suspicion: In 1.6. at a CPU specific threshold, comprehension (and equivalent generator) are superior over loops (proof beyond my skill) in this special case.

Use `\$` in benchmarks, otherwise results can be meaningless.

I checked, no differences here, so I stayed without \$ for this discussion.