# Optimizing Calculation in Julia compared to C (New to Julia)

You’re welcome! And, when you time make sure to do something like the following if you are using global variables inside your function argument.

using BenchMarkTools

julia> A = rand(3,3);

julia> @btime inv($A); # we interpolate the global variable A with$A
1.191 μs (10 allocations: 2.31 KiB)

1 Like

Quick question regarding global variables and const. Coming from matlab, my work flow would be to do some minor calculations to create arrays (see example code below) then feed them into functions, like Sim1d up above, which would perform the bulk of the calculations. Would this work flow be frowned upon when using Julia? If I am reading all of the recommended documentation correctly, it seems that this would be bad as it could lead to significantly more computations than needed and potentially large data usage assuming I am not using @view properly.

Also here is a link to my full code showing how I structured it plus the two different implementations of my algorithm that was discussed above (https://github.com/loki3118/Julia_Code.git)

Example Code:

resol = 1e-6
tp = 0.005
num_pts = Int(round(tp / resol))
ts = range(-1, 1, length = num_pts)
#--------------------------------------------------------------------------
bw1 = 4.5 / tp
FOV = 10

scale = (2 * pi * bw1) / (gam * FOV)
grad = scale * ones(1, num_pts)


Yes, see below. From the link I shared, one notes that “REPL/Global scope does not allow for type-specificity”. In sum_bad() we define x, y in the global scope without a const in front so the compiler can’t specialize on the types. In the next two functions, I either pass x, y as arguments or declare them as a const. In both cases, the compiler can generate specialized code (see the output of @code_warntype) which results in a dramatic performance improvement.

Personally, I prefer the style of sum_good(x, y) because it’s easy for readability and you know what arguments the function operates on. In sum_also_good(), while there is no performance hit rel. to sum_good(), you would have to figure out from the REPL that you passed x,y with a const in front. This becomes cumbersome for large pieces of code.

julia> x = 1
1

julia> y = 2
2

x + y
end
sum_bad (generic function with 1 method)

Variables

Body::Any
1 ─ %1 = (Main.x + Main.y)::Any
└──      return %1

######################
# Two possible fixes
######################

# 1)  passing x, y as arguments
julia> function sum_good(x, y)
x + y
end
sum_good (generic function with 1 method)

julia> @code_warntype sum_good(x, y)
Variables
#self#::Core.Compiler.Const(sum_good, false)
x::Int64
y::Int64

Body::Int64
1 ─ %1 = (x + y)::Int64
└──      return %1

# 2) declaring globals with 'const' in front
julia> const a = 1
1

julia> const b = 2
2

julia> function sum_also_good()
a + b
end
sum_also_good (generic function with 1 method)

julia> @code_warntype sum_also_good()
Variables
#self#::Core.Compiler.Const(sum_also_good, false)

Body::Int64
1 ─ %1 = (Main.a + Main.b)::Core.Compiler.Const(3, false)
└──      return %1

#benchmarks
20.744 ns (0 allocations: 0 bytes)
3

julia> @btime sum_good($x,$y)
0.001 ns (0 allocations: 0 bytes)
3

julia> @btime sum_also_good()
0.001 ns (0 allocations: 0 bytes)
3

1 Like

You generally want to use a barrier function if the dynamic dispatch will have to occur. If you really want to use a global variable, you should enclose it in a barrier function to limit the number of times dispatch has to occur. In the following example, you can see that the normal usage of the global variable results in many allocations. By using a barrier function, you limit to the dynamic dispatch cost to a single function call.

using BenchmarkTools

scale = 2.0
a=rand(1000);

foo(a) = sum(x->x*scale,a)

function foo_caller(a)
foo_inner(a,scale)
end

function foo_inner(a,scale)
sum(x->x*scale,a)
end

@btime foo($a) 37.999 μs (2999 allocations: 46.86 KiB) @btime foo_caller($a)  # The barrier limits the allocations
228.454 ns (2 allocations: 32 bytes)
@btime foo_inner($a,$scale)  #No allocations were necessary past the barrier
83.090 ns (0 allocations: 0 bytes)

2 Likes

I would store M as a vector of static arrays since one dimension (3) is known and fixed. This will probably speed things up quite a lot.

See my previous answer here for a similar situation

2 Likes

That is not a problem here, M is an input argument to the function so it should be just fine.

3 Likes

I have converted a fair amount of Matlab/C code to Julia. If you have existing and tested computational C code, my advice is to keep the code as is and only do the necessary syntactical and indexing changes. Julia will handle loops perfectly fine and converting to a vectorized style is only worth it if it adds clarity without compromising performance.

5 Likes

1 Like

Threading can certainly be done with loops (in fact, I thought threaded loops were the normal way). I’m not veryfamiliar with gpu processing, but there are things like https://github.com/vchuravy/GPUifyLoops.jl

Vector/matrix operation can benefit from BLAS, though.

2 Likes

Yes, but in the context of porting code from Matlab/C, chances are that the parts that were implemented in C were so exactly because they would not fit into BLAS calls.

Thank you everyone for the feed back (and Happy New Year). It has been a big help getting though I am still running into issues where I am not utilizing the features of Julia. Per all of the recommendations, I tried to specify the input type to avoid any Type-Instabilities but have been struggling getting a sense of how to properly utilize in-place operations (like foo! and .*) and @views. I am trying to optimize the following code to better utilize these operations to cut down on garbage collecting. Also any recommendations on how to track down where I am generating garbage would be greatly appreciated.

function M_Avg(numAvg::Int64, Readout::Int64, x_pts::Int64, y_pts::Int64, M_in_vec::Array{Float64,1})

@progress for n1 in 1:Int(x_pts/numAvg)
for n2 in 1:Int(y_pts/numAvg)
M_out[:, n1, n2]=
sum(sum(M_in_mat[:,1+numAvg*(n1-1):n1*numAvg,1+numAvg*(n2-1):n2*numAvg],dims = 3),dims = 2)/numAvg
end
end
return M_out
end

1. Can you time your code using @btime and report the output? This will help us see where there might be potential for improvements.

2. Use @codewarntype to check for type-instabilities in your code and let us know the ouput in the reply. This will also be helpful as a first step.

3. Where is M_out_RO initialized?

Addressing point #3 first. I miss typed it, sorry about that. It was initialized as all zeros that is then filled during the two for loops.

Regarding the other points, I have this function inside of another function that I then run, i.e. setup like the sudo code below. I believe placing @btime when I call my outer most function wont be that informative as it includes several different functions being called. That might be also true of @codewarntype but I am not that familiar with that package. To over come this I’ve been using @profiler to get a better sense of whats going on.

# Defined in its own .jl files else where
Function Function_to_Calculate_things()
more calculation stuff
return more calculation stuff
end

Function M_Avg()
return output
end

Function Do_stuff(....)
calculates_stuff = Function_to_Calculate_things(...)
Matrix_avg = M_Avg(Calculated_stuff,...)
return Matrix_avg
end

# Main .jl file that defined everything required for Do_stuff()
include("Do_stuff.jl")
Final_out = Do_stuff(...)


Hmm, I see. I thought you wanted help in M_Avg() only. Can you give an example for the values of arguments that you are running M_Avg() with?

If you’d like, we could start looking at things starting at M_Avg() first and see if we can improve this inner-most function. Then, maybe you can scale up based on suggestions that I or others might have in the responses.

I suspect that M_Avg() is slow and an issue as I know there should be a better/cleaner way to implement what I am trying to do. I do agree that addressing the inner most loop would be the easiest. In words, I’m trying to average adjacent points together in a matrix. In 2d this would be like going from 100x50 to 100x25, assuming I’m averaging 2 points together, hence numAvg.

# Example Input for M_Avg
function M_Avg(numAvg = 2, Readout = 2742, x_pts = 80, y_pts = 80, M_in_vec =  Array{Float64,1} )

# Expected Output
size(M_out) = 2742, 40, 40


On a side note, since I’ve already constructed a somewhat complex program that calls all of my functions from an included .jl file. What is considered the best practice when trying to debug to get the most performance? Did I already construct my program in a way that is frowned upon?

is an excellent resource for profiling code.

It’s hard to say without really understanding what each function in your code is supposed to do. However, in your function definition of M_Avg() your last argument is an Array{Float64, 1} but in your latest reply where you are giving an example for function arguments to use you are passing an integer as your last argument, so you should edit that unless I am mistaken.

To make things easier, pick smaller arguments for your function, run it and post with output below.

2 Likes

Thank you for the suggestion, I made the change. I was feeding in a Array{Float64,1} but wanted to give a sense of the size.

Here is a example of my function with corresponding @btimes. Thinking about my problem more, I might be able to utilize some of Julia’s imresize functions to try to get the results I would like. In a sense I am trying to downsample an 3D image (TxMxN with T >> (M,N)).

Thank you again for the help!

using LinearAlgebra
using BenchmarkTools

## Functions
function M_Avg(numAvg::Int64, Readout::Int64, x_pts::Int64, y_pts::Int64, M_in_vec::Array{Float64,1})

for n1 in 1:Int(x_pts/numAvg)
for n2 in 1:Int(y_pts/numAvg)
M_out[:, n1, n2]=
sum(sum(M_in_mat[:,1+numAvg*(n1-1):n1*numAvg,1+numAvg*(n2-1):n2*numAvg],dims = 3),dims = 2)/numAvg
end
end
return M_out
end

## Example 1
long_dim = 10;
smaller_dim = 10;
A = rand(long_dim,smaller_dim,smaller_dim)
A_smaller = @btime M_Avg(2, long_dim, smaller_dim, smaller_dim, A[:])

@btime:
57.610 μs (554 allocations: 47.67 KiB)

## Example 2
long_dim = 3000;
smaller_dim = 80;
A = rand(long_dim,smaller_dim,smaller_dim)
A_smaller = @btime M_Avg(2, long_dim, smaller_dim, smaller_dim, A[:])

@btime:
267.356 ms (41606 allocations: 477.44 MiB)



You may find the discussion above helpful. Especially, the comment made by @baggepinnen

Re @codewarntype, it’s just a helpful quick-way to check for all type-instabilities like so

julia> function f(x, y)
return x + y
end
f (generic function with 1 method)

julia> @code_warntype f(1, 2)
Variables
#self#::Core.Compiler.Const(f, false)
x::Int64
y::Int64

Body::Int64
1 ─ %1 = (x + y)::Int64
└──      return %1

julia> @code_warntype f(1.0, 2.0)
Variables
#self#::Core.Compiler.Const(f, false)
x::Float64
y::Float64

Body::Float64
1 ─ %1 = (x + y)::Float64
└──      return %1

julia> x = 1
1

julia> y = 2
2

julia> function g()
return x + y
end
g (generic function with 1 method)

julia> @code_warntype g()
Variables
#self#::Core.Compiler.Const(g, false)

Body::Any
1 ─ %1 = (Main.x + Main.y)::Any
└──      return %1


In the above example, f(x, y) is type-stable (all types are known at compile-time) but g() is not as the return type is not known at compile time. The Any in the @code_warntype output for g() will likely show up in your REPL in red. The reason is that in f(x,y) I pass x, y as arguments but they are global variables for the function g().

Notice that for all functions a specialized compiled code is generated depending on your argument type. @code_warntype simply allows you to see that. Please keep in mind that all type-instabilities are not important as mentioned in the link below. There are some functions in Base like Base.getindex which are type-unstable by design. Similarly, the iterator interface for a for loop will also be type-unstable and so on. So don’t worry about all type-instabilities!

https://docs.julialang.org/en/v1/manual/performance-tips/#man-code-warntype-1

1 Like

Just one hint: you are using sin and cos (twice) of \theta in one expression. Julia has sincos to calculate those utilizing both even and odd exponents. I would try something like

sin_θ,cos_θ = sincos(θ)

1 Like

I don’t have experience with imresize but I changed your function into the following:

function M_Avg_new(numAvg::N, Readout::N, x_pts::N, y_pts::N, M_in_vec::Vector{T}) where {N<:Integer,T}

M_in_mat = reshape(M_in_vec, Readout, x_pts, y_pts)
M_out = zeros(T, Readout, x_pts÷numAvg, y_pts÷numAvg)

idx(n) = 1+numAvg*(n-1):n*numAvg

@inbounds for n1 in 1:x_pts÷numAvg
for n2 in 1:y_pts÷numAvg

s = zero(T)

for x in idx(n1), y in idx(n2)
s += M_in_mat[r,x,y]
end

M_out[r, n1, n2] = s/numAvg
end
end
end

return M_out
end


On Julia v1.3.1 I get the following timings:

julia> @btime M_Avg(2, long_dim, smaller_dim, smaller_dim, $(A[:])); 153.276 ms (41604 allocations: 330.96 MiB) julia> @btime M_Avg_new(2, long_dim, smaller_dim, smaller_dim,$(A[:]));
28.924 ms (4 allocations: 36.62 MiB)


• No need to specify Int64 or Float64 in this case. It will not make your code run faster and will only make your code less “generic”. If you like to annotate types of input arguments for clarity, you can do what I did above to still have “generic” code.
• You can use div or ÷ (\div tab in Julia REPL) for integer division.
• Inside of your loop you create a 3D subarray of the input array M_in_mat. This will lead to memory allocations and hurts performance. In some situations you can simply fix this with @views but I think in your case that won’t help because the 3D subarray is not a continuous block in memory. Therefore I decided to just write out things explicitly with loops instead.
• Just like indexing into an array allocates memory for the subarrays, calling sum(A, dims=3) or sum(A, dims=2) will allocate memory for the output array. Note that sum(A) will not allocate because the output is just a number (assuming the elements of A are themselves just numbers). So again, I just decided to write out a for loop explicitly.
• The @inbounds around the for loops prevents the runtime checking of whether the array indices are within bounds. This can save some time especially when there’s a lot of looping and not a lot of computation.
• Perhaps it’s possible to use things like map, dot fusion and whatnot to reduce the number of loops without introducing memory allocations. But I thought that - since you’re coming from Matlab - it would maybe be nice to see that just writing out loops is not actually bad for performance.
• When $btimeing your code, put a $ in front or the input arguments to “interpolate” them. To illustrate why: when you do @btime foo(rand(1000,1000)), you are timing the creation of the random matrix as well. With @btime foo(\$(rand(1000,1000))) you are timing what foo does.