Why is copying using a loop is much slower than `copy` for large arrays?

jishnub · November 22, 2022, 9:24am

julia> function copy2(A::Array)
           B = similar(A);
           @inbounds @simd for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copy2 (generic function with 1 method)

julia> A = rand(2000, 2000);

julia> @btime copy($A);
  4.162 ms (2 allocations: 30.52 MiB)

julia> @btime copy2($A);
  6.213 ms (2 allocations: 30.52 MiB)

julia> A = rand(20, 20);

julia> @btime copy($A);
  459.298 ns (1 allocation: 3.25 KiB)

julia> @btime copy2($A);
  432.207 ns (1 allocation: 3.25 KiB)

julia> copy(A) == copy2(A)
true

So copy2 is faster for small arrays, where time doesn’t matter as much, but is significantly slower for larger arrays. How to get that C-like performance in Julia?

My versioninfo:

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = subl

anon56330260 · November 22, 2022, 1:37pm

A possible explaination :
The builtin-in copy simply calls a builtin function jl_array_copy, which in turn calls a builtin array constructor and memcpy function. Your simple implementation cannot beat C’s highly optimized memcpy (memcpy is really smart and it can sometimes even utilize special feature offered by operating system). The case of small array may be related to alignment problem or overhead in memcpy, since you copy directly and LLVM knows more information, some logic in memcpy can be skipped.

Elrod · November 22, 2022, 2:04pm

Non temporal stores. You can try LoopVectorization.vmapnt!(identity, y, x).

photor · November 22, 2022, 2:25pm

Does memcpy use multiple threads internally?

mkitti · November 22, 2022, 5:00pm

julia> A = rand(2000, 2000);

julia> @btime copy($A);
  12.939 ms (2 allocations: 30.52 MiB)

julia> @btime copy2($A);
  8.595 ms (2 allocations: 30.52 MiB)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 48 × Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 96 virtual cores

On my machine, copy2 is faster.

anon56330260 · November 22, 2022, 5:53pm

I get a timing result similar to OP (4ms copy vs 6ms copy2) and I also use Linux system, so the difference may be explained by different implementation of memcpy on different OS or architecture. Anyway, I think sticking to the builtin implementation is always a good choice, so just let core developers handle these things as they are more professional on these micro-optimizations than normal users.

anon56330260 · November 22, 2022, 6:06pm

I think this is I/O bounded, not computation bounded. So if every thread can access memory independently, then it can greatly improve performance. But all the implementations that I know are single-threaded, and standard might not specify whether memcpy should be single- or multi- threaded. This is reasonable, because it’s a standard library function that provides some basic utilities and deals with common cases…

Elrod · November 22, 2022, 7:53pm

I get:

julia> function copyto2!(B, A)
           @inbounds @simd for i in eachindex(A, B)
               B[i] = A[i]
           end
           B
       end
copyto2! (generic function with 1 method)

julia> using LoopVectorization

julia> copytont!(B, A) = vmapnt!(identity, B, A)
copytont! (generic function with 1 method)

julia> A = rand(2000, 2000); B = similar(A);

julia> @btime copyto!($B, $A);
  2.724 ms (0 allocations: 0 bytes)

julia> @btime copyto2!($B, $A);
  3.933 ms (0 allocations: 0 bytes)

julia> @btime copytont!($B, $A);
  2.952 ms (0 allocations: 0 bytes)

So copyto! still wins, but copytont! is much closer.

If you want multithreaded:

julia> copytontt!(B, A) = vmapntt!(identity, B, A)
copytontt! (generic function with 1 method)

julia> @btime copytontt!($B, $A);
  498.435 μs (0 allocations: 0 bytes)

Aside the fact that copyto! was still faster than vmapnt! at large sizes, it should be faster at small sizes, too. Non temporal stores must be aligned, and they’ll hurt performance if you need to read them soon.

The memmove and memcopy should/will do the right thing as a function of size.

photor · November 24, 2022, 5:30am

How to use vmap or similar methods to vectorize or accelerate the following mapping:

function addsub!(z,w,x,y)
    @inbounds for i in eachindex(z,w)
        z[i]=x[i]+y[i]
        w[i]=x[i]-y[i]
    end
end

Thanks a lot.

Elrod · November 27, 2022, 2:47am

@inbounds should do a good job optimizing this already.
Do you want non-temporal stores as well?

photor · November 27, 2022, 3:25am

Yes, that’s my question, because vmapnt! seems not to be able to update two variables at the same time

Topic		Replies	Views
Normal vs broadcasted slice assignment General Usage	5	251	February 16, 2024
Why do functions like similar and copy make 2 allocations for large arrays? New to Julia	7	113	July 19, 2024
Comparing matrix multiplication between Julia and C Performance question , matrices	2	883	July 7, 2020
Why is this simple function twice as slow as its Python version Performance question	97	4195	April 12, 2021
Slow performance for `rand(ComplexF64, ...)` Performance	5	343	January 8, 2021

Why is copying using a loop is much slower than `copy` for large arrays?

Related topics