For-loop vs list-comprehension

pedrobtz · April 6, 2021, 5:07pm

Why isn’t the list-comprehension as fast as the for-loop?

using Random
using BenchmarkTools

y = zeros(UInt8,256);
y[UInt8('A')] = UInt8('U');
y[UInt8('C')] = UInt8('G');
y[UInt8('G')] = UInt8('C');
y[UInt8('T')] = UInt8('A');

n = 100000000;
x = String([ rand("ACGT") for i in 1:n ]);

function foo1(x,y)
    [ @inbounds y[i] for i in transcode(UInt8, x) ]
end


function foo2(x,y)
    x = transcode(UInt8, x)
    z = Vector{UInt8}(undef, length(x))
    for i in 1:length(z)
        @inbounds  z[i] = y[x[i]]
    end
    return z
end

julia> @btime foo1(x,y);
  110.531 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  93.206 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

oheil · April 6, 2021, 5:16pm

For me the comprehension is faster:

julia> @btime foo1(x,y);
  56.660 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  80.200 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)

oheil · April 6, 2021, 5:27pm

Now I see it, your functions aren’t equivalent.
foo2 should be:

julia> function foo2(x,y)
           x = transcode(UInt8, x)
           z = Vector{UInt8}(undef, length(x))
           index=1
           for i in x
               @inbounds  z[index] = y[i]
               index+=1
           end
           return z
       end

With that I get:

julia> @btime foo2(x,y);
  50.135 ms (2 allocations: 95.37 MiB)

oheil · April 6, 2021, 5:28pm

This is a getindex less in the code.

pedrobtz · April 6, 2021, 5:52pm

ah, good point, yes, but I get slower times with that modification in v1.5.4:

function foo3(x,y)
    x = transcode(UInt8, x)
    z = Vector{UInt8}(undef, length(x))
    j = 1
    for i in x
        @inbounds  z[j] = y[i]
        j+=1
    end
    return z
end

jjulia> @btime foo3(x,y);
  132.790 ms (2 allocations: 95.37 MiB)

pedrobtz · April 6, 2021, 6:03pm

On version v1.6.0

julia> @btime foo1(x,y);
  84.296 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  79.005 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  75.708 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) m3-7Y32 CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

I don’t understand assembly, but interesting to see that the compiler seem not be able to reduce some small overhead in what one can consider a simple loop.

oheil · April 6, 2021, 6:04pm

~~What’s your hardware?~~ you are on MacOS Intel Core m3 (? mobile CPU ?)
This is what I get in 1.5.3:

julia> @btime foo1(x,y);
  81.065 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  81.338 ms (2 allocations: 95.37 MiB)

The functions do compile quite different:

julia> @code_lowered foo1(x,y)
CodeInfo(
1 ─ %1 = Main.:(var"#3#4")
│   %2 = Core.typeof(y)
│   %3 = Core.apply_type(%1, %2)
│        #3 = %new(%3, y)
│   %5 = #3
│   %6 = Main.transcode(Main.UInt8, x)
│   %7 = Base.Generator(%5, %6)
│   %8 = Base.collect(%7)
└──      return %8
)

julia> @code_lowered foo3(x,y)
CodeInfo(
1 ─       x@_9 = x@_2
│         x@_9 = Main.transcode(Main.UInt8, x@_9)
│   %3  = Core.apply_type(Main.Vector, Main.UInt8)
│   %4  = Main.length(x@_9)
│         z = (%3)(Main.undef, %4)
│         j = 1
│   %7  = x@_9
│         @_6 = Base.iterate(%7)
│   %9  = @_6 === nothing
│   %10 = Base.not_int(%9)
└──       goto #4 if not %10
2 ┄ %12 = @_6
│         i = Core.getfield(%12, 1)
│   %14 = Core.getfield(%12, 2)
│         $(Expr(:inbounds, true))
│   %16 = Base.getindex(y, i)
│         Base.setindex!(z, %16, j)
│         val = %16
│         $(Expr(:inbounds, :pop))
│         val
│         j = j + 1
│         @_6 = Base.iterate(%7, %14)
│   %23 = @_6 === nothing
│   %24 = Base.not_int(%23)
└──       goto #4 if not %24
3 ─       goto #2
4 ┄       return z
)

~~In the case of foo3 we see two loops, one for transcode and the second the for loop.~~ (wrong interpretation)

oheil · April 6, 2021, 6:05pm

Or perhaps a single regression in 1.5.4.

oheil · April 6, 2021, 6:27pm

If we create foo4 as

julia> function foo4(x,y)
           collect( y[x] for x in transcode(UInt8, x) )
       end

which has identical lowered code as foo1:

julia> @code_lowered foo4(x,y)
CodeInfo(
1 ─ %1 = Main.:(var"#27#28")
│   %2 = Core.typeof(y)
│   %3 = Core.apply_type(%1, %2)
│        #27 = %new(%3, y)
│   %5 = #27
│   %6 = Main.transcode(Main.UInt8, x)
│   %7 = Base.Generator(%5, %6)
│   %8 = Main.collect(%7)
└──      return %8
)

I get in 1.5.3:

julia> @btime foo1(x,y);
  80.580 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
  92.395 ms (2 allocations: 95.37 MiB)

which is quite astounding.
In 1.6.0 the timings are nearly identical.

I have installed 1.5.4 now and I get:

julia> @btime foo1(x,y);
  80.843 ms (2 allocations: 95.37 MiB)

julia> @btime foo2(x,y);
  81.314 ms (2 allocations: 95.37 MiB)

julia> @btime foo3(x,y);
  81.172 ms (2 allocations: 95.37 MiB)

julia> @btime foo4(x,y);
  93.333 ms (2 allocations: 95.37 MiB)

which doesn’t reproduce your timings, so my guess is, that there are differences in the hardware (CPU).

I don’t want to investigate why foo4 is slower despite that @code_lowered is exactly identical to foo1, because I suspect it to be very hard to find the cause.

What I learned: stay with 1.6, don’t look back.

pedrobtz · April 6, 2021, 7:00pm

I have a MacBookAir (2017) not the fastest. Many thanks

czylabsonasa · April 7, 2021, 5:01am

julia> @btime z1=foo1(x,y);
  69.349 ms (2 allocations: 95.37 MiB)

julia> @btime z2=foo2(x,y);
  117.741 ms (2 allocations: 95.37 MiB)

julia> @btime z3=foo3(x,y);
  62.121 ms (2 allocations: 95.37 MiB)

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

oheil · April 7, 2021, 8:32am

I believe we see here the differences of the CPUs like perhaps L1,2,3 cache sizes.
Perhaps, for you, at some smaller n the timings become equal.
If I try with n=1_000_000_000 instead of n=100_000_000 I have the same step in the timings as you:

julia> @btime foo1(x,y);
  572.141 ms (2 allocations: 953.67 MiB)

julia> @btime foo2(x,y);
  821.871 ms (2 allocations: 953.67 MiB)

julia> @btime foo3(x,y);
  551.218 ms (2 allocations: 953.67 MiB)

julia> @btime foo4(x,y);
  572.606 ms (2 allocations: 953.67 MiB)

But proofing that this is caused by different CPUs is beyond my capability.

pedrobtz · April 7, 2021, 10:03am

Can we maybe conclude comparing foo2 vs foo3 that extra getindex causes the difference in performance? So, list-comprehension performance in this case very okay.

Would be interesting to be able to profile code at llvm level.

oheil · April 7, 2021, 10:28am

Of course that is the case. I thought this was the easy part of this whole discussion. Conclusion: comprehensions, loops and generators can have similar timings (at least in 1.6. and if problem size is not too big).

After checking that as solved in my mind I was analyzing the different Julia versions and problem sizes, which let me think, that CPU differences play also a role when seeing differences in timings. Suspicion: In 1.6. at a CPU specific threshold, comprehension (and equivalent generator) are superior over loops (proof beyond my skill) in this special case.

Skoffer · April 7, 2021, 10:37am

Use $ in benchmarks, otherwise results can be meaningless.

oheil · April 7, 2021, 10:56am

I checked, no differences here, so I stayed without $ for this discussion.

Topic		Replies	Views
Performance of loops New to Julia broadcast , loops , aliasing	30	865	August 19, 2024
Conditional array comprehension vs. simple for loop - 750k allocations vs. 1 allocation - why? Performance loops	14	1428	June 25, 2021
Map vs Loops & Array Comprehensions in Julia 1.0 Performance	10	5812	March 3, 2021
Comprehensions versus pre-allocation Performance question	16	2038	June 22, 2023
Slow Dict comprehension General Usage dictionary , comprehension	5	434	February 20, 2023

For-loop vs list-comprehension

Related topics