Memory allocations in for loop variable

Can someone explain how to minimize the memory allocations in the line for i=1:N. The following is output from running --track-allocation=user for a simple function.

        - using LinearAlgebra: norm

        - function get_orientations!(rotvec::AbstractArray{Float64,2},
        -     orientations::AbstractArray{Float64,2})
        -     N = size(rotvec, 2)::Int
114299917     for i = 1:N
        -         r = @view rotvec[:,i]
  4800000         R = norm(r)
        0         q0 = cos(0.5*R)
        0         q1, q2, q3 = sin(0.5*R)*(r/R)
 38400000         orientations[:,i] = [2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3),
        -         q0^2 - q1^2 - q2^2 + q3^2]
        -     end
        - end
        - 
        - rotvec = rand(3,100000)::Array{Float64,2}
        - ori = zeros(Float64, size(rotvec))
        - get_orientations!(rotvec, ori)

This code is an excellent use case for StaticArrays.

Currently, you’re using a 2D matrix to store a list of 3-element points. That’s a very common idiom in Matlab and Numpy, but Julia can do better. If you instead store your vector of points as a Vector of 3-element static vectors, then it becomes easy to manipulate each element without allocating any memory at all.

Benchmarking your code:

julia> using BenchmarkTools

julia> @btime get_orientations!($rotvec, $ori)
  12.987 ms (700000 allocations: 41.20 MiB)

And here’s an alternative implementation using static vectors instead:

julia> using StaticArrays

julia> function orientation(r::SVector{3, Float64})
           R = norm(r)
           q0 = cos(0.5 * R)
           q1, q2, q3 = sin(0.5 * R) * (r / R)
           SVector(2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3),
                   q0^2 - q1^2 - q2^2 + q3^2)
       end
orientation (generic function with 1 method)

julia> rotvec = [rand(SVector{3, Float64}) for i in 1:100000];

julia> ori = similar(rotvec);

julia> @btime $ori .= orientation.($rotvec)
  1.830 ms (0 allocations: 0 bytes)

That’s almost 10 times faster, uses less code, and allocates exactly zero memory.

13 Likes

Thank you very much for the help. I am a newbie. Just began exploring Julia by porting some of my Python code.

1 Like

Also, see the excellent Rotations.jl package.

4 Likes

Thank you. You are right. My function implements some quaternion multiplication. So Rotations.jl seems very useful.

@rdeits I am unable to find reference for the $ operator you are using. I am using Julia v1.0 and when I try anything like $ori in the REPL, I get ERROR: syntax: "$" expression outside quote. The only context it works for me is in string interpolation.

redits can give a more technical answer, but you can see some description of the interpolation with $ in calls to BenchmarkTools at GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

1 Like

It’s specific to the @btime benchmarking macro from BenchmarkTools: GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

Edit: that means that you only need to use it when benchmarking. Inside your own functions, you can just do ori .= orientation.(rotvec)

1 Like

You can still get 25% more speedup by allowing @inline. The original timing:

julia> @btime $ori .= orientation.($rotvec);
  1.827 ms (0 allocations: 0 bytes)

and with @inline:

julia> @btime $ori .= orientation.($rotvec);
  1.473 ms (0 allocations: 0 bytes)

That is impressive, of course, though I’m still wondering why Julia didn’t close the gap with ifort in this case. The following Fortran translation runs 1.76X faster, 0.837 ms (aggressive optimizations ON), will Julia be able to close this gap in the coming updates or is this an unavoidable cost for dynamic dispatch. I have a partial answer to this which lies in optimizing math functions sin, cos, norm, etc., but I’m not sure, I wanted to hear from you.

program main 
   implicit none 
   real*8  :: rotvec(3,100000), ori(3,100000), sm = 0
   integer :: i, t0, t1, count_rate, count_max

   call random_number(rotvec)

   call system_clock(t0, count_rate, count_max)
   do i = 1, 1000 
      call get_orientations(rotvec, ori)
      sm = sm + ori(1,1000) ! to force calculation
   end do
   call system_clock(t1)
   
   print *, real(t1-t0)/count_rate, sm 

   contains 
   subroutine get_orientations(rotvec,ori)
      real*8  :: rotvec(:,:), ori(:,:)
      real*8  :: q(3), q0, q1, q2, q3, R
      integer :: i 
      do i  = 1, size(rotvec, 2)
         q  = rotvec(:,i)
         R  = norm2(q)
         q0 = cos(0.5*R)
         q  = sin(0.5*R) * q/R
         q1 = q(1);  q2 = q(2);  q3 = q(3)
         ori(:,i) = [2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3), q0**2 - q1**2 - q2**2 + q3**2]
      end do
   end
end program main

Output:

0.8370000 sec     sm = 818.936749149117
2 Likes

So no one explains why the line for i=1:N has a high memory allocation? Isn’t that the key question here?

1 Like

Memory allocation reports are often a bit fuzzy, in my experience. Since each line of julia code doesn’t necessarily correspond to exactly one line of lowered code or LLVM IR or assembly, it’s hard to align the reports perfectly (there may also just be bugs). I’d guess that those allocations are actually due to the objects allocated inside the for loop.

3 Likes

I have seen memory allocations in such (and other similarlry strange places) dozens of times.
I have always assumed that thr line number is off (which rdeits seems to confirm).
After all it is (in my cases) often a temporary array which is causing allocation. And these are easy to spot.

@Seif_Shebl, I’m curious what others will say! FWIW,

Julia + OpenBLAS LIBM

julia> x = rand(10000);

julia> @btime sin.(x);
  89.980 μs (26 allocations: 79.23 KiB)

Julia + Intel LIBM

julia> @btime sin.(x);
  83.800 μs (26 allocations: 79.23 KiB)

ans same for cos.

Yes, that explains part of it, it seems Julia doesn’t have fast (less accurate) versions for core math functions. I’m not sure, but people here know more, I’d love also to hear from others.

The following table compares run times of four math functions in Julia and Fortran (ifort), which asserts my assumptions. Allocations in all Julia versions are (2 allocations: 76.29 MiB).

Function Julia Times Julia (Yeppp) Fortran Times
sin 78.356 ms 27.854 ms 12.9999999E-03
cos 72.389 ms 26.460 ms 15.0000000E-03
exp 92.180 ms 23.605 ms 12.9999999E-03
norm 23.066 ms 4.9999999E-03

EDIT:

Corrected results after removing allocations as suggested by @Elrod and @Sukera:

Function Julia Times Julia (Yeppp) Fortran Times
sin 58.230 ms 15.578 ms 12.9999999E-03
cos 60.838 ms 14.081 ms 15.0000000E-03
exp 80.156 ms 11.975 ms 12.9999999E-03
norm 5.381 ms 4.9999999E-03

Yeppp seems really promising, but we still need fast pure Julia versions of these functions for scalar inputs.

Codes used if someone wants to reproduce:

Pure Julia

julia> using BenchmarkTools

julia> x = rand(10000000);

julia> @btime sin.($x);
  78.356 ms (2 allocations: 76.29 MiB)

julia> @btime cos.($x);
  72.389 ms (2 allocations: 76.29 MiB)

julia> @btime exp.($x);
  92.180 ms (2 allocations: 76.29 MiB)

julia> @btime norm.($x);
  23.066 ms (2 allocations: 76.29 MiB)

Julia using Yeppp package

julia> using Yeppp

julia> @btime Yeppp.sin($x);
  27.854 ms (2 allocations: 76.29 MiB)

julia> @btime Yeppp.cos($x);
  26.460 ms (2 allocations: 76.29 MiB)

julia> @btime Yeppp.exp($x);
  23.605 ms (2 allocations: 76.29 MiB)

And Fortran

program main 
   real*8  :: w(10000000), x(10000000), y(10000000), z(10000000) 
   integer :: t0, t1, count_rate, count_max

   call random_seed()
   call random_number(w)
   call random_number(x)
   call random_number(y)
   call random_number(z)

   call system_clock(t0, count_rate, count_max)
   w = sin(w)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(w)

   call system_clock(t0)
   x = cos(x)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(x)

   call system_clock(t0)
   y = exp(y)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(y)

   call system_clock(t0)
   r = norm2(z)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, r
end program main 

Store the results. Eg, $y .= sin.($x) or Yeppp.sin!($y, $x). You can also try @fastmath.
Unfortunately Yeppp is a poor cross-platform solution. Lots of CPUs on which it is slow, and it was unmaintained, last I checked.

Storing the result cuts the time in more than half for me since there’s no more allocations (can’t compare to fortran at the moment) while @fastmath does virtually nothing? Very strange.

julia> @btime $y .= sin.($x);
  76.100 ms (0 allocations: 0 bytes)

julia> @fastmath @btime $y .= sin.($x);
  75.789 ms (0 allocations: 0 bytes)

julia> @btime @fastmath $y .= sin.($x);    # maybe order matters?
  74.606 ms (0 allocations: 0 bytes)       # looks like it doesn't

julia> @btime @fastmath sin.($x);
  163.377 ms (2 allocations: 76.29 MiB)

julia> @btime sin.($x);
  165.506 ms (2 allocations: 76.29 MiB)

I’ve corrected the timings accordingly, and yes it is as twice as fast now.

Have you tried @fastmath on your machine? It might just be that the machine I’m currently on can’t make much use of it, but yours might be able to.

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

There’s also julia --math-mode=fast for the same effect without the need for the macro in the code - this should make it faster but there might be some losses in accuracy.

EDIT:

Starting julia with the fast math shaves ~10ms off of @btime $y .= sin.($x); - not much, but worth a try in my opinion.

I already do julia --math-mode=fast because this is also true for Fortran when enabling aggressive optimizations /fast. The strange thing is that --math-mode=fast only improves sin by 8% and cos by < 20%, but does nothing in other cases. This was what lead me to the conclusion that Julia doesn’t have fast (less accurate) versions of math functions, these are very important say 99% of the time. Only in very rare situations do we have to opt to ultra high accuracy, but most of the time we don’t.

Two terrific talks from JuliaCon 2018 touching on this:

  1. The Future of Microprocessors, by Sophie Wilson
  2. Tricks and Tips in Numerical Computing, by Nick Higham