Slicing array on julia 4000ms vs c++ 400ms

Palli · November 9, 2024, 5:36pm

You can do something like:

julia> my_rates_buffer = zeros(1001)

julia> function custom_test!(rates, data::Vector{Float64}, i::Int64)
           if(i < 1001)
               return nothing
           end
           copyto!(rates, @view data[i-1000:i])
           return nothing
       end
custom_test! (generic function with 1 method)

julia> @time custom_test!(my_rates_buffer, my_test, 5000)  # Only first use, no problem at all.
  0.076472 seconds (112.11 k allocations: 5.599 MiB, 99.96% compilation time)

julia> @time custom_test!(my_rates_buffer, my_test, 5000)
  0.000009 seconds

Now there’s no allocation, but still a copy, despite that @view. You need it otherwise you get two allocations (I believe 1 more since 1.11). copyto! returns something, and without that return nothing after it would be returned. If comment it out I get it, and still no allocation. I think it’s then returning the view (i.e. without that it will allocate, not more though). I mainly include it in to not return either nothing or the view, but that type stability seems to cost you nothing (probably unless you use that result?).

Before in your code, you did malloc implicitly, as in C++ (new in C++ needs not actually call malloc according to the C++ standard, though in practice it’s always implemented that way; your “new” is implicit, also, comes from invoking std::vector).

That malloc isn’t any slower in Julia (maybe even faster…). Not in Julia you’re allocating a lot (and copying to that memory buffer that’s never in the same place since the previous wasn’t yet released), i.e.:

julia> @time for i in 1:10000 Libc.malloc(200) end
  0.000889 seconds

julia> @time for i in 1:10000 b = Libc.malloc(200); Libc.free(b) end
  0.000201 seconds

Julia could do effectively the latter, like in C++. And add your code in-between. But at least freeing (the most recent allocation) makes malloc (in Libc) faster.

With Bumper.jl or something you can in effect do the latter, except it’s even faster “malloc” and “free”, since not really done at all. Something in Julia that does implicit free is coming, i.e. allocating on the stack when possible, for such no cost, when it can apply.

tradingtsm · November 9, 2024, 5:56pm

Hi everyone,

I will post an update on what I found that works close to c++ speed of 400ms, correct me if you think I’m wrong at any code, also I have put some questions commented inside code,
remember that this is only about copying a slice of a vector, not using references or views :

the inital julia code (4000ms) that ran on a loop was this

rates = data[i-1000:i] //maybe this is slow because we have to allocate in the loop a new vector each time and we have to copy access data

the cpp code was this //400ms

std::vector rates = std::vector<double>(data.begin() + (i - 1000), data.begin() + i); //this was also allocating a new vector each time in the loop, why do you think in cpp the same code runs fast ?
we can also use std::copy

Let’s see the 3 options that measure around 300ms, I will just show the important piece of code not the full code

Code 1
this allocates using the vector operation without creating a new array each time inside the loop, also I can change any value of rates without modifying d.c values, because it’s preallocated.

rates::Vector{Float64} = Vector{Float64}(undef, 1001)
rates .= @view d.c[i-1000:i]

Code 2
both of these functions, copy in place to rates vector, also I use @views for the data being copied (d.c), so we avoid a second copy of d.c I guess, the difference between using @view and not is from 4000ms to 300ms. Also we can change rates values without affecting d.c values.

copyto!(rates, @view d.c[i-1000:i])
copy!(rates, @view d.c[i-1000:i])

Code 3
this one is similar to the code we used before : rates .= @view d.c[i-1000:i].

@inbounds begin
    for j in 1:1001
        rates[j] = data[i - 1001 + j]
    end
end

That’s it, post your opinions before we can close this post with a solution, and a big thank you everyone for giving tips and corrections.

tradingtsm · November 9, 2024, 6:22pm

That’s really interesting Palli, I’m checking this out, but I can’t make it work using loops on my code.
this first code shows an improvement of time in the second call

@time custom_test!(d.data, rates, 5000)
@time custom_test!(d.data, rates, 5000)

but calling custom test two times in loop doesn’t show an improvement on speed

@elapsed begin
        for i in 1:length(d.data)
            custom_test!(d.data, rates, i)
        end
end
@elapsed begin
        for i in 1:length(d.data)
            custom_test!(d.data, rates, i)
        end
end

if I do this in your code, then I see the improvement on speed as you show on your code, I will paste the example

my_rates_buffer = zeros(1001)
my_test = fill(1.0, 10001)

function custom_test!(rates, data::Vector{Float64}, i::Int64, firstcall)
           if(firstcall)
            return nothing
           end
           if(i < 1001)
               return nothing
           end
           copyto!(rates, @view data[i-1000:i])
           return nothing
       end

@time begin 
    for i in 1:10000
        custom_test!(my_rates_buffer, my_test, i, true)
    end
end

@time begin 
    for i in 1:10000
        custom_test!(my_rates_buffer, my_test, i, false)
    end
end

I don’t know why in my code is not working as expected, I will continue trying.

Palli · November 9, 2024, 6:34pm

Since I don’t have the full code, i.e. am only testing one iteration, I get lower:

julia> @btime custom_test!(my_rates_buffer, my_test, 5000)  # wrong way for get the $ and $
  199.713 ns (0 allocations: 0 bytes)

julia> @btime custom_test!($my_rates_buffer, $my_test, 5000)
  178.102 ns (0 allocations: 0 bytes)


and even lower:
julia> @btime copyto!($my_rates_buffer, @view $my_test[1001-1000:1001]);
  121.402 ns (0 allocations: 0 bytes)

julia> @code_native @inbounds copyto!(my_rates_buffer, @view my_test[1001-1000:1001]);

Note the @inbounds there is helpful, for time, still strangely I seemingly see some assembly related to it. I think though mostly gone.

You mean:

It’s seemingly plausible, but while loops are ok in general I think not needed here. Copyto! should be optimized, maybe even more? My only concern was if it does in-bound checks, it does (good by default), and if you could fully do away with them if you wanted.

I don’t see any reason to use:

I think/thought only for when from within same array:

help?> copy!
..

  │ Warning
  │
  │  Behavior can be unexpected when any mutated argument shares memory with any other argument.

  See also copyto!.

  │ Julia 1.1
  │
  │  This method requires at least Julia 1.1. In Julia 1.0 this method is available from the Future standard library as Future.copy!.

The warning can’t applies when dst [destination array] is separate.

I do not ever recall hearing before of this deprecated Future stdlib,

julia> using Future

help?> Future
search: Future outer return

  The Future module implements future behavior of already existing functions, which will replace the current version in a future release of Julia.

julia> @edit copy!(my_rates_buffer, @view my_test[1001-1000:1001])  # it does end with copy_to! so both see ok

copyto! also does some checks, then:

help?> Base.copyto_unaliased!

I must be doing something wrong since calling Base.copyto_unaliased! directly is ~~4.5x~~ 20% slower…:

julia> @time s_prime = Base.unalias(my_rates_buffer, @view my_test[1001-1000:1001])
  0.000012 seconds (2 allocations: 96 bytes)

julia> @btime Base.copyto_unaliased!(IndexStyle(my_rates_buffer), my_rates_buffer, IndexStyle(s_prime), s_prime)
  548.102 ns (0 allocations: 0 bytes)

julia> @btime Base.copyto_unaliased!(IndexStyle($my_rates_buffer), my_rates_buffer, IndexStyle($s_prime), $s_prime)
  145.298 ns (1 allocation: 48 bytes)

Benny · November 10, 2024, 2:17am

Correcting the off-by-1 errors wasn’t expected to make a difference, shaving off one element or iteration out of thousands is negligible. Changing returns could possibly make more of a difference, but without the full benchmarking calls and compiled code analysis, we can’t tell you what ended up happening; we don’t even know when you instantiate custom_struct. As mentioned, the compilers can make very strange decisions (the large difference in semantically allocating a buffer and copying values to it does suggest C++ is eliminating some code), so it’s worth making a benchmark specifically for copying vectors without the extra composite type and indices checking, and verifying it occurs at runtime. I’ve never gotten far enough into C++ to benchmark and analyze compiled code so I won’t be of help there.

Julia doesn’t have pass-by-copy or pass-by-reference semantics on the language level, not even pointers. However, mutable types like Vector do have to be allocated in a central location and shared by references (variables, elements, fields), so it’s communicated as a memory address, the same way passing by reference compiles to. If you’re preallocating rates because you’re worried about large buffer copying when passing arguments, then the worry is unwarranted and you’re actually changing the benchmark by allocating before the custom_test loop, not per iteration (then again, it’s possible the C++ compiler did such an optimization and that’s why the preallocated Julia performance is so close, we need compiled code analysis to know).

Unannotated global variables have uninferrable types (or rather they must be inferred as ::Any) and cannot be optimized well; whether this significantly affects the runtime depends on how long custom_test! takes to run compared to the overhead of its dynamically dispatched call per iteration. However, it’s recommended to put benchmarking code in a function and pass preallocated buffers as arguments; in fact BenchmarkTools’ macros and $-interpolation support does this for you.

An aside, I confirmed in v1.11.1 that unconditionally returning nothing after allocating a vector is NOT compiled to removing the vector allocation. It could happen if a constant call is inlined e.g. into a benchmark loop, so it does appear that a minor degree of side effect analysis happens sometimes to confirm the vector is unnecessary. It’s a fair design for the compiler to not eliminate obviously dead code like this because dead code elimination is more useful for optimizing constants and inlined calls; your code is neater and the compiler saves unnecessary work if you remove the obviously dead code manually.