Memory allocations in for loop variable

amit112amit · August 30, 2018, 8:05pm

Can someone explain how to minimize the memory allocations in the line for i=1:N. The following is output from running --track-allocation=user for a simple function.

        - using LinearAlgebra: norm

        - function get_orientations!(rotvec::AbstractArray{Float64,2},
        -     orientations::AbstractArray{Float64,2})
        -     N = size(rotvec, 2)::Int
114299917     for i = 1:N
        -         r = @view rotvec[:,i]
  4800000         R = norm(r)
        0         q0 = cos(0.5*R)
        0         q1, q2, q3 = sin(0.5*R)*(r/R)
 38400000         orientations[:,i] = [2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3),
        -         q0^2 - q1^2 - q2^2 + q3^2]
        -     end
        - end
        - 
        - rotvec = rand(3,100000)::Array{Float64,2}
        - ori = zeros(Float64, size(rotvec))
        - get_orientations!(rotvec, ori)

rdeits · August 30, 2018, 8:19pm

This code is an excellent use case for StaticArrays.

Currently, you’re using a 2D matrix to store a list of 3-element points. That’s a very common idiom in Matlab and Numpy, but Julia can do better. If you instead store your vector of points as a Vector of 3-element static vectors, then it becomes easy to manipulate each element without allocating any memory at all.

Benchmarking your code:

julia> using BenchmarkTools

julia> @btime get_orientations!($rotvec, $ori)
  12.987 ms (700000 allocations: 41.20 MiB)

And here’s an alternative implementation using static vectors instead:

julia> using StaticArrays

julia> function orientation(r::SVector{3, Float64})
           R = norm(r)
           q0 = cos(0.5 * R)
           q1, q2, q3 = sin(0.5 * R) * (r / R)
           SVector(2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3),
                   q0^2 - q1^2 - q2^2 + q3^2)
       end
orientation (generic function with 1 method)

julia> rotvec = [rand(SVector{3, Float64}) for i in 1:100000];

julia> ori = similar(rotvec);

julia> @btime $ori .= orientation.($rotvec)
  1.830 ms (0 allocations: 0 bytes)

That’s almost 10 times faster, uses less code, and allocates exactly zero memory.

amit112amit · August 30, 2018, 9:02pm

Thank you very much for the help. I am a newbie. Just began exploring Julia by porting some of my Python code.

simonbyrne · August 30, 2018, 9:04pm

Also, see the excellent Rotations.jl package.

amit112amit · August 30, 2018, 9:09pm

Thank you. You are right. My function implements some quaternion multiplication. So Rotations.jl seems very useful.

amit112amit · August 30, 2018, 9:24pm

@rdeits I am unable to find reference for the $ operator you are using. I am using Julia v1.0 and when I try anything like $ori in the REPL, I get ERROR: syntax: "$" expression outside quote. The only context it works for me is in string interpolation.

lwhitefox · August 30, 2018, 9:34pm

redits can give a more technical answer, but you can see some description of the interpolation with $ in calls to BenchmarkTools at GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

rdeits · August 30, 2018, 9:34pm

It’s specific to the @btime benchmarking macro from BenchmarkTools: GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

Edit: that means that you only need to use it when benchmarking. Inside your own functions, you can just do ori .= orientation.(rotvec)

Seif_Shebl · August 30, 2018, 10:33pm

You can still get 25% more speedup by allowing @inline. The original timing:

julia> @btime $ori .= orientation.($rotvec);
  1.827 ms (0 allocations: 0 bytes)

and with @inline:

julia> @btime $ori .= orientation.($rotvec);
  1.473 ms (0 allocations: 0 bytes)

That is impressive, of course, though I’m still wondering why Julia didn’t close the gap with ifort in this case. The following Fortran translation runs 1.76X faster, 0.837 ms (aggressive optimizations ON), will Julia be able to close this gap in the coming updates or is this an unavoidable cost for dynamic dispatch. I have a partial answer to this which lies in optimizing math functions sin, cos, norm, etc., but I’m not sure, I wanted to hear from you.

program main 
   implicit none 
   real*8  :: rotvec(3,100000), ori(3,100000), sm = 0
   integer :: i, t0, t1, count_rate, count_max

   call random_number(rotvec)

   call system_clock(t0, count_rate, count_max)
   do i = 1, 1000 
      call get_orientations(rotvec, ori)
      sm = sm + ori(1,1000) ! to force calculation
   end do
   call system_clock(t1)
   
   print *, real(t1-t0)/count_rate, sm 

   contains 
   subroutine get_orientations(rotvec,ori)
      real*8  :: rotvec(:,:), ori(:,:)
      real*8  :: q(3), q0, q1, q2, q3, R
      integer :: i 
      do i  = 1, size(rotvec, 2)
         q  = rotvec(:,i)
         R  = norm2(q)
         q0 = cos(0.5*R)
         q  = sin(0.5*R) * q/R
         q1 = q(1);  q2 = q(2);  q3 = q(3)
         ori(:,i) = [2*(q0*q2 + q1*q3), -2*(q0*q1 - q2*q3), q0**2 - q1**2 - q2**2 + q3**2]
      end do
   end
end program main

Output:

0.8370000 sec     sm = 818.936749149117

LeoK987 · August 31, 2018, 12:29am

So no one explains why the line for i=1:N has a high memory allocation? Isn’t that the key question here?

rdeits · August 31, 2018, 12:34am

Memory allocation reports are often a bit fuzzy, in my experience. Since each line of julia code doesn’t necessarily correspond to exactly one line of lowered code or LLVM IR or assembly, it’s hard to align the reports perfectly (there may also just be bugs). I’d guess that those allocations are actually due to the objects allocated inside the for loop.

bernhard · August 31, 2018, 9:57am

I have seen memory allocations in such (and other similarlry strange places) dozens of times.
I have always assumed that thr line number is off (which rdeits seems to confirm).
After all it is (in my cases) often a temporary array which is causing allocation. And these are easy to spot.

carstenbauer · August 31, 2018, 10:12am

@Seif_Shebl, I’m curious what others will say! FWIW,

Julia + OpenBLAS LIBM

julia> x = rand(10000);

julia> @btime sin.(x);
  89.980 μs (26 allocations: 79.23 KiB)

Julia + Intel LIBM

julia> @btime sin.(x);
  83.800 μs (26 allocations: 79.23 KiB)

ans same for cos.

Seif_Shebl · August 31, 2018, 12:11pm

Yes, that explains part of it, it seems Julia doesn’t have fast (less accurate) versions for core math functions. I’m not sure, but people here know more, I’d love also to hear from others.

The following table compares run times of four math functions in Julia and Fortran (ifort), which asserts my assumptions. Allocations in all Julia versions are (2 allocations: 76.29 MiB).

Function	Julia Times	Julia (Yeppp)	Fortran Times
sin	78.356 ms	27.854 ms	12.9999999E-03
cos	72.389 ms	26.460 ms	15.0000000E-03
exp	92.180 ms	23.605 ms	12.9999999E-03
norm	23.066 ms	–	4.9999999E-03

EDIT:

Corrected results after removing allocations as suggested by @Elrod and @Sukera:

Function	Julia Times	Julia (Yeppp)	Fortran Times
sin	58.230 ms	15.578 ms	12.9999999E-03
cos	60.838 ms	14.081 ms	15.0000000E-03
exp	80.156 ms	11.975 ms	12.9999999E-03
norm	5.381 ms	–	4.9999999E-03

Yeppp seems really promising, but we still need fast pure Julia versions of these functions for scalar inputs.

Seif_Shebl · August 31, 2018, 12:16pm

Codes used if someone wants to reproduce:

Pure Julia

julia> using BenchmarkTools

julia> x = rand(10000000);

julia> @btime sin.($x);
  78.356 ms (2 allocations: 76.29 MiB)

julia> @btime cos.($x);
  72.389 ms (2 allocations: 76.29 MiB)

julia> @btime exp.($x);
  92.180 ms (2 allocations: 76.29 MiB)

julia> @btime norm.($x);
  23.066 ms (2 allocations: 76.29 MiB)

Julia using Yeppp package

julia> using Yeppp

julia> @btime Yeppp.sin($x);
  27.854 ms (2 allocations: 76.29 MiB)

julia> @btime Yeppp.cos($x);
  26.460 ms (2 allocations: 76.29 MiB)

julia> @btime Yeppp.exp($x);
  23.605 ms (2 allocations: 76.29 MiB)

And Fortran

program main 
   real*8  :: w(10000000), x(10000000), y(10000000), z(10000000) 
   integer :: t0, t1, count_rate, count_max

   call random_seed()
   call random_number(w)
   call random_number(x)
   call random_number(y)
   call random_number(z)

   call system_clock(t0, count_rate, count_max)
   w = sin(w)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(w)

   call system_clock(t0)
   x = cos(x)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(x)

   call system_clock(t0)
   y = exp(y)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, sum(y)

   call system_clock(t0)
   r = norm2(z)
   call system_clock(t1)
   print *, real(t1-t0)/count_rate, r
end program main

Elrod · August 31, 2018, 12:34pm

Store the results. Eg, $y .= sin.($x) or Yeppp.sin!($y, $x). You can also try @fastmath.
Unfortunately Yeppp is a poor cross-platform solution. Lots of CPUs on which it is slow, and it was unmaintained, last I checked.

Sukera · August 31, 2018, 12:44pm

Storing the result cuts the time in more than half for me since there’s no more allocations (can’t compare to fortran at the moment) while @fastmath does virtually nothing? Very strange.

julia> @btime $y .= sin.($x);
  76.100 ms (0 allocations: 0 bytes)

julia> @fastmath @btime $y .= sin.($x);
  75.789 ms (0 allocations: 0 bytes)

julia> @btime @fastmath $y .= sin.($x);    # maybe order matters?
  74.606 ms (0 allocations: 0 bytes)       # looks like it doesn't

julia> @btime @fastmath sin.($x);
  163.377 ms (2 allocations: 76.29 MiB)

julia> @btime sin.($x);
  165.506 ms (2 allocations: 76.29 MiB)

Seif_Shebl · August 31, 2018, 1:15pm

I’ve corrected the timings accordingly, and yes it is as twice as fast now.

Sukera · August 31, 2018, 1:16pm

Have you tried @fastmath on your machine? It might just be that the machine I’m currently on can’t make much use of it, but yours might be able to.

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

There’s also julia --math-mode=fast for the same effect without the need for the macro in the code - this should make it faster but there might be some losses in accuracy.

EDIT:

Starting julia with the fast math shaves ~10ms off of @btime $y .= sin.($x); - not much, but worth a try in my opinion.

Seif_Shebl · August 31, 2018, 1:55pm

I already do julia --math-mode=fast because this is also true for Fortran when enabling aggressive optimizations /fast. The strange thing is that --math-mode=fast only improves sin by 8% and cos by < 20%, but does nothing in other cases. This was what lead me to the conclusion that Julia doesn’t have fast (less accurate) versions of math functions, these are very important say 99% of the time. Only in very rare situations do we have to opt to ultra high accuracy, but most of the time we don’t.

Two terrific talks from JuliaCon 2018 touching on this:

The Future of Microprocessors, by Sophie Wilson
Tricks and Tips in Numerical Computing, by Nick Higham

Topic		Replies	Views
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Poor performance due to memory allocations? Performance memory-allocation	17	2854	January 15, 2019
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	792	June 6, 2021
How to tell VS Code to use the fast math option? VS Code fast-math	30	3959	December 7, 2018

Memory allocations in for loop variable

Related topics