For-loop vs list-comprehension

Why isn’t the list-comprehension as fast as the for-loop?

using Random
using BenchmarkTools

y = zeros(UInt8,256);
y[UInt8('A')] = UInt8('U');
y[UInt8('C')] = UInt8('G');
y[UInt8('G')] = UInt8('C');
y[UInt8('T')] = UInt8('A');

n = 100000000;
x = String([ rand("ACGT") for i in 1:n ]);

function foo1(x,y)
    [ @inbounds y[i] for i in transcode(UInt8, x) ]
end


function foo2(x,y)
    x = transcode(UInt8, x)
    z = Vector{UInt8}(undef, length(x))
    for i in 1:length(z)
        @inbounds  z[i] = y[x[i]]
    end
    return z
end
julia> @btime foo1(x,y);
  110.531 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  93.206 ms (2 allocations: 95.37 MiB)
julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

For me the comprehension is faster:

julia> @btime foo1(x,y);
  56.660 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  80.200 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

Now I see it, your functions aren’t equivalent.
foo2 should be:

julia> function foo2(x,y)
           x = transcode(UInt8, x)
           z = Vector{UInt8}(undef, length(x))
           index=1
           for i in x
               @inbounds  z[index] = y[i]
               index+=1
           end
           return z
       end

With that I get:

julia> @btime foo2(x,y);
  50.135 ms (2 allocations: 95.37 MiB)

This is a getindex less in the code.

ah, good point, yes, but I get slower times with that modification in v1.5.4:

function foo3(x,y)
    x = transcode(UInt8, x)
    z = Vector{UInt8}(undef, length(x))
    j = 1
    for i in x
        @inbounds  z[j] = y[i]
        j+=1
    end
    return z
end
jjulia> @btime foo3(x,y);
  132.790 ms (2 allocations: 95.37 MiB)

On version v1.6.0

julia> @btime foo1(x,y);
  84.296 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  79.005 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  75.708 ms (2 allocations: 95.37 MiB)
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

I don’t understand assembly, but interesting to see that the compiler seem not be able to reduce some small overhead in what one can consider a simple loop.

What’s your hardware? you are on MacOS Intel Core m3 (? mobile CPU ?)
This is what I get in 1.5.3:

julia> @btime foo1(x,y);
  81.065 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  81.338 ms (2 allocations: 95.37 MiB)

The functions do compile quite different:

julia> @code_lowered foo1(x,y)
CodeInfo(
1 ─ %1 = Main.:(var"#3#4")
β”‚   %2 = Core.typeof(y)
β”‚   %3 = Core.apply_type(%1, %2)
β”‚        #3 = %new(%3, y)
β”‚   %5 = #3
β”‚   %6 = Main.transcode(Main.UInt8, x)
β”‚   %7 = Base.Generator(%5, %6)
β”‚   %8 = Base.collect(%7)
└──      return %8
)

julia> @code_lowered foo3(x,y)
CodeInfo(
1 ─       x@_9 = x@_2
β”‚         x@_9 = Main.transcode(Main.UInt8, x@_9)
β”‚   %3  = Core.apply_type(Main.Vector, Main.UInt8)
β”‚   %4  = Main.length(x@_9)
β”‚         z = (%3)(Main.undef, %4)
β”‚         j = 1
β”‚   %7  = x@_9
β”‚         @_6 = Base.iterate(%7)
β”‚   %9  = @_6 === nothing
β”‚   %10 = Base.not_int(%9)
└──       goto #4 if not %10
2 β”„ %12 = @_6
β”‚         i = Core.getfield(%12, 1)
β”‚   %14 = Core.getfield(%12, 2)
β”‚         $(Expr(:inbounds, true))
β”‚   %16 = Base.getindex(y, i)
β”‚         Base.setindex!(z, %16, j)
β”‚         val = %16
β”‚         $(Expr(:inbounds, :pop))
β”‚         val
β”‚         j = j + 1
β”‚         @_6 = Base.iterate(%7, %14)
β”‚   %23 = @_6 === nothing
β”‚   %24 = Base.not_int(%23)
└──       goto #4 if not %24
3 ─       goto #2
4 β”„       return z
)

In the case of foo3 we see two loops, one for transcode and the second the for loop. (wrong interpretation)

1 Like

Or perhaps a single regression in 1.5.4.

If we create foo4 as

julia> function foo4(x,y)
           collect( y[x] for x in transcode(UInt8, x) )
       end

which has identical lowered code as foo1:

julia> @code_lowered foo4(x,y)
CodeInfo(
1 ─ %1 = Main.:(var"#27#28")
β”‚   %2 = Core.typeof(y)
β”‚   %3 = Core.apply_type(%1, %2)
β”‚        #27 = %new(%3, y)
β”‚   %5 = #27
β”‚   %6 = Main.transcode(Main.UInt8, x)
β”‚   %7 = Base.Generator(%5, %6)
β”‚   %8 = Main.collect(%7)
└──      return %8
)

I get in 1.5.3:

julia> @btime foo1(x,y);
  80.580 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
  92.395 ms (2 allocations: 95.37 MiB)

which is quite astounding.
In 1.6.0 the timings are nearly identical.

I have installed 1.5.4 now and I get:

julia> @btime foo1(x,y);
  80.843 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  81.314 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  81.172 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
  93.333 ms (2 allocations: 95.37 MiB)

which doesn’t reproduce your timings, so my guess is, that there are differences in the hardware (CPU).

I don’t want to investigate why foo4 is slower despite that @code_lowered is exactly identical to foo1, because I suspect it to be very hard to find the cause.

What I learned: stay with 1.6, don’t look back.

1 Like

I have a MacBookAir (2017) not the fastest. Many thanks

julia> @btime z1=foo1(x,y);
  69.349 ms (2 allocations: 95.37 MiB)

julia> @btime z2=foo2(x,y);
  117.741 ms (2 allocations: 95.37 MiB)

julia> @btime z3=foo3(x,y);
  62.121 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

I believe we see here the differences of the CPUs like perhaps L1,2,3 cache sizes.
Perhaps, for you, at some smaller n the timings become equal.
If I try with n=1_000_000_000 instead of n=100_000_000 I have the same step in the timings as you:

julia> @btime foo1(x,y);
  572.141 ms (2 allocations: 953.67 MiB)

julia> @btime foo2(x,y);
  821.871 ms (2 allocations: 953.67 MiB)

julia> @btime foo3(x,y);
  551.218 ms (2 allocations: 953.67 MiB)

julia> @btime foo4(x,y);
  572.606 ms (2 allocations: 953.67 MiB)

But proofing that this is caused by different CPUs is beyond my capability.

Can we maybe conclude comparing foo2 vs foo3 that extra getindex causes the difference in performance? So, list-comprehension performance in this case very okay.

Would be interesting to be able to profile code at llvm level.

Of course that is the case. I thought this was the easy part of this whole discussion. Conclusion: comprehensions, loops and generators can have similar timings (at least in 1.6. and if problem size is not too big).

After checking that as solved in my mind I was analyzing the different Julia versions and problem sizes, which let me think, that CPU differences play also a role when seeing differences in timings. Suspicion: In 1.6. at a CPU specific threshold, comprehension (and equivalent generator) are superior over loops (proof beyond my skill) in this special case.

Use $ in benchmarks, otherwise results can be meaningless.

I checked, no differences here, so I stayed without $ for this discussion.