High order approximation of a multi-country trade model

Hi,

I hope this finds you well. I recently moved over from Matlab in search of better computational performance. I am doing this post to get best practices on higher-order approximations for large problems such as the one in the title.

The current problem is to solve a multi-country trade model using bilateral trade data for 44 countries. Currently, we have a simple Armington model setup, but in the future, we would like to extend the model to allow for multiple industries and within-country geographies.

As a first approach, I tried automatic differentiation with ForwardDiff.jl. The results were very slow and caused StackOverFlowError at the 4th order approximation. The codes made a very large number of allocations. So I attributed this low performance to the ForwardDiff.jl package and moved on to the next method. This MWE can also be posted if needed.

Next, I tried numerical differentiation with FiniteDifferences.jl as a more intuitive way to obtain derivatives. I have already checked that even Matlab can handle the approximation up to 3rd order, so I thought this would not cause any speed problems. The official page of FiniteDifferences.jl also includes the following information: “FiniteDifferences.jl supports higher-order approximation and step size adaptation” and “Finite difference methods are optimized to minimize allocations,” so I was optimistic about the results.

But to my surprise, my Julia code was very slow at the second-order approximation and not practical. Again, a very large number of allocations were made, and I imagine this may have affected performance. Below is the MWE, and I also attached the results of @profview when I tried the 2nd-floor approximation using FiniteDifferences.jl.

Based on these circumstances, my current expectation is that the performance of my code may be slowed down due to the internal behavior of these packages, regardless of how the differentiation is performed (ForwardDiff.jl or FiniteDifferences.jl). How can I quickly get a higher-order approximation of the title problem? Any help is highly appreciated.

Best,
Daisuke

#---------------------------------------
# housekeeping
#---------------------------------------

using FiniteDifferences, Profile, LinearAlgebra

# function about equilibrium condition (excess labor demand)
function eha_wzet_combined(wzet,y,x,sig,N)
# define the matrix dimension
tau_mat = vcat(ones(N-1,N), [wzet[N+1]*ones(N-1); 1]');
# armington condition
excess = wzet[1:N] -
sum(
y .* ((tau_mat .* wzet[1:N]) .^ (1-sig) ./
sum( x .* (tau_mat .* wzet[1:N]) .^ (1-sig), dims = 1 )) .*
wzet[1:N]',
dims = 2
)[:,1];
return excess
end

# function to compute the sum of 2nd order derivatives
function sum_dDyFdyj_D11(wzet,N,ddw,Dg1)
out = ddw[1](wzet)[1:N-1,1:N-1] * Dg1(wzet)[1]
if N > 2
for j in 2:N-1
out += ddw[j](wzet)[1:N-1,1:N-1] * Dg1(wzet)[j];
end
end
return out
end

#---------------------------------------
# parameters and data
#---------------------------------------

N = 44;             # number of countries
sig = 4;            # elasticity of substitution

homebias = 0.9;     # the size of the domestic consumption tendency
X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N);
y = X./sum(X,dims=2);
x = X./sum(X,dims=1);

wzet = ones(N+1);   # initial point

#---------------------------------------
# 1st order solution
#---------------------------------------

ed(wzet) = eha_wzet_combined(wzet,y,x,sig,N);
dfwzet(wzet) = jacobian(central_fdm(5,1), ed, wzet)[1];
Dg1(wzet) = - dfwzet(wzet)[1:N-1,1:N-1] \ dfwzet(wzet)[1:N-1,N+1];

#---------------------------------------
# 2nd order solution
#---------------------------------------

dfzet = wzet -> dfwzet(wzet)[:,N+1];
ddfzwzet(wzet) = jacobian(central_fdm(5,1), dfzet, wzet)[1];

ddw = Array{Function}(undef, N-1)
for j in 1:N-1
dfwj = wzet -> dfwzet(wzet)[:,j];
ddw[j] = wzet -> jacobian(central_fdm(5,1), dfwj, wzet)[1]
end

H2(wzet) = ( sum_dDyFdyj_D11(wzet,N,ddw,Dg1) + ddfzwzet(wzet)[1:N-1,1:N-1] ) * Dg1(wzet) + ddfzwzet(wzet)[1:N-1,1:N-1] * Dg1(wzet) + ddfzwzet(wzet)[1:N-1,N+1];
Dg2(wzet) = - dfwzet(wzet)[1:N-1,1:N-1] \ H2(wzet);

Dg2(ones(N+1))   # run once to force compilation
@time Dg2(ones(N+1))

#@profview Dg2(ones(N+1))


1 Like

Forgot to attach the profview result:

Hello and welcome to the community
I think most of the allocations actually come from your code. In Julia, creating a slice of an array like this array[a:b] creates a copy. You can avoid this by calling view or using the macro @view array[a:b].

The sum expressions can likely be rewritten into a double loop that avoids allocations altogether, remember, in Julia, loops are fast and are sometimes the preferred way of writing code for performance.

You can learn more in the performance tips secntion of the documentation.

1 Like

@baggepinnen Thanks so much for your quick suggestions. I have followed two of your advice. The first one seems to work great as I added @views to all the lines containing slices of a matrix, and the allocations were reduced by ~400-500%. (my apologies for not carefully checking the performance tips page. I knew the page, but I had only skimmed through it)

However, the second piece of advice did not help so much as I have rewritten the sum part of my codes into the loop, but neither the allocation nor total computation time did not change. As a result, I do not see that my codes with ForwardDifferences.jl are still too slow to be practical at the second order. I am attaching the updated MWE again. If anybody can give me what went wrong and further suggestions, that would be great.

#---------------------------------------
# housekeeping
#---------------------------------------

using FiniteDifferences, Profile, LinearAlgebra

# function about equilibrium condition (excess labor demand)
function eha_wzet_combined(wzet,y,x,sig,N)
# define the matrix dimension
@views tau_mat = vcat(ones(N-1,N), [wzet[N+1]*ones(N-1); 1]');
# armington condition
# 1. compute the change in the market access (cma)
cma_mat = x .* (tau_mat .* wzet[1:N]) .^ (1-sig);
cma = zeros(1,N);
for i = 1:N
@views cma += cma_mat[[i],:];
end
# 2. sum over the change in labor demand across destination (cld)
@views cld_mat = y .* (tau_mat .* wzet[1:N]) .^ (1-sig) .* wzet[1:N]' ./ cma;
cld = zeros(N,1);
for j = 1:N
@views cld += cld_mat[:,[j]];
end
@views excess = wzet[1:N] - cld;
return excess
end

# function to compute the sum of 2nd order derivatives
function sum_dDyFdyj_D11(wzet,N,ddw,Dg1)
@views out = ddw[1](wzet)[1:N-1,1:N-1] * Dg1(wzet)[1]
if N > 2
for j in 2:N-1
@views out += ddw[j](wzet)[1:N-1,1:N-1] * Dg1(wzet)[j];
end
end
return out
end

#---------------------------------------
# parameters and data
#---------------------------------------

N = 44;             # number of countries
sig = 4;            # elasticity of substitution

homebias = 0.9;     # the size of the domestic consumption tendency
X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N);
y = X./sum(X,dims=2);
x = X./sum(X,dims=1);

wzet = ones(N+1);   # initial point

#---------------------------------------
# 1st order solution
#---------------------------------------

ed(wzet) = eha_wzet_combined(wzet,y,x,sig,N);
@views dfwzet(wzet) = jacobian(central_fdm(5,1), ed, wzet)[1];
@views Dg1(wzet) = - dfwzet(wzet)[1:N-1,1:N-1] \ dfwzet(wzet)[1:N-1,N+1];

Dg1(wzet)   # run once to force compilation
@time Dg1(wzet)

#---------------------------------------
# 2nd order solution
#---------------------------------------

@views dfzet = wzet -> dfwzet(wzet)[:,N+1];
@views ddfzwzet(wzet) = jacobian(central_fdm(5,1), dfzet, wzet)[1];

ddw = Array{Function}(undef, N-1)
for j in 1:N-1
@views dfwj = wzet -> dfwzet(wzet)[:,j];
@views ddw[j] = wzet -> jacobian(central_fdm(5,1), dfwj, wzet)[1]
end

@views H2(wzet) = ( sum_dDyFdyj_D11(wzet,N,ddw,Dg1) + ddfzwzet(wzet)[1:N-1,1:N-1] ) * Dg1(wzet) + ddfzwzet(wzet)[1:N-1,1:N-1] * Dg1(wzet) + ddfzwzet(wzet)[1:N-1,N+1];
@views Dg2(wzet) = - dfwzet(wzet)[1:N-1,1:N-1] \ H2(wzet);

Dg2(wzet)   # run once to force compilation
@time Dg2(wzet)

#@profview Dg2(ones(N+1))


Now I have started to break down the problem. I began by looking into the core function of computing the excess labor demand eha_wzet_combined. Using @baggepinnen’s and other pieces of advice from the documentation, I have arrived at the following code. Unfortunately, this still allocates 17, and I don’t know how to reduce it further. Is there any advice?

using FiniteDifferences, LinearAlgebra, BenchmarkTools

#---------------------------------------
# internal functions
#---------------------------------------

function eha_wzet_combined(wzet,y,x,sig,N)
# define the matrix dimension
@views tau_mat = vcat(ones(N-1,N), [wzet[N+1]*ones(N-1); 1]');
# armington condition
@views excess = wzet[1:N] -
sum(
y .* ((tau_mat .* wzet[1:N]) .^ (1-sig) ./
sum( x .* (tau_mat .* wzet[1:N]) .^ (1-sig), dims = 1 )) .*
wzet[1:N]',
dims = 2
)[:,1];
return excess
end

#---------------------------------------
# parameters and data
#---------------------------------------

N = 44;             # number of countries
sig = 4;            # elasticity of substitution

homebias = 0.9;     # the size of the domestic consumption tendency
X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N);
y = X./sum(X,dims=2);
x = X./sum(X,dims=1);

wzet = ones(N+1);   # initial point

#---------------------------------------
# GO!
#---------------------------------------
@btime eha_wzet_combined(wzet,y,x,sig,N)


The result is:

13.917 μs (17 allocations: 63.38 KiB)


Nice job!

Let’s break down a few statements further:

vcat(ones(N-1,N), [wzet[N+1]*ones(N-1); 1]')


this will allocate

• ones(N-1)
• the product of ones with wzet[N+1]
• the concatenation of [wzet[N+1]*ones(N-1); 1]
• ones(N-1,N)
• the outermost concatenation by vcat

so this simple line alone allocates at least 5 times.
To reduce these allocations, you may allocate a single ones and the write wzet[N+1] into that at appropriate places. Better yet would be to allocate this array once, outside of the function, and then reuse the same allocated memory each time you call the function so that it never has to be allocated again.

sum( x .* (tau_mat .* wzet[1:N]) .^ (1-sig), dims = 1 )


This line will allocate the matrix that is summed over and then one array for the output of sum. Since you are computing a sum, over something that is an elementwise product, there should be no need to allocate at all, you could simply loop over all the entries and write the sum straight into the output array.

2 Likes

Thanks so much for the quick reply. Two points:

1. I have followed your point that tau_mat should be pre-allocated. The below MWE have done that, and I have checked that @btime @views broadcast!(.^, tau_mat, wzet[N+1], kappa); produces 4.018 μs (1 allocation: 16 bytes). This is a great improvement, and I think the remaining 1 allocation is due to the slicing wzet[N+1] (am I correct?). But this point seems minor and I would like to see more significant bottlenecks.

2. Regarding the sum point, I am still stuck. I tried the following loop-based summation that gives the same solution, but the overall allocation actually did not improve compared to the previous codes using sum. Am I missing anything?

using FiniteDifferences, LinearAlgebra, BenchmarkTools

#---------------------------------------
# internal functions
#---------------------------------------

# function about equilibrium condition (excess labor demand)
function eha_wzet_combined(wzet,y,x,kappa,sig,N,tau_mat,cma,cld)
# define the matrix dimension

# armington condition
# 1. compute the change in the market access (cma)
cma_mat = x .* (tau_mat .* wzet[1:N]) .^ (1-sig);
for i = 1:N
@views cma += cma_mat[[i],:];
end
# 2. sum over the change in labor demand across destination (cld)
@views cld_mat = y .* (tau_mat .* wzet[1:N]) .^ (1-sig) .* wzet[1:N]' ./ cma;
for j = 1:N
@views cld += cld_mat[:,[j]];
end
@views excess = wzet[1:N] - cld;
return excess
end

#---------------------------------------
# parameters and data
#---------------------------------------

N = 44;             # number of countries
sig = 4;            # elasticity of substitution

homebias = 0.9;     # the size of the domestic consumption tendency
X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N);
y = X./sum(X,dims=2);
x = X./sum(X,dims=1);

kappa = vcat(zeros(N-1,N), [ones(N-1); 0]'); # the kernel of the trade cost change
tau_mat = ones(N,N);

wzet = ones(N+1);   # initial point

# pre-allocate
cma = zeros(1,N);   # change in the market access
cld = zeros(N,1);   # change in labor demand

#---------------------------------------
# GO!
#---------------------------------------

@btime eha_wzet_combined(wzet,y,x,sig,N);

# a check
@btime @views broadcast!(.^, tau_mat, wzet[N+1], kappa);


The result is:

13.958 μs (17 allocations: 63.38 KiB)


This seems like the sort of code that will be much clearer if you just write out loops (which is fast in Julia!) and other finer-grained operatoins. Matlab teaches some weird coding habits.

In particular, I don’t see why you need tau_mat at all. If I’m reading this correctly, every column of tau_mat is identical except in the last row, so you are wasting a huge amount of computational effort in constructing this matrix, exponentiating, it summing its rows, etcetera. If you just wrote out loops, you could make it dramatically faster, and probably simpler and clearer, in addition to saving memory allocations.

PS. No need to put @views on every line where you use a slice. You can just put it once in front of the whole function, i.e.  @views function eha_wzet_combined(wzet,y,x,sig,N), then every slice will become a view.

6 Likes

For example, consider y .* ((tau_mat .* wzet[1:N]) .^ (1-sig) ./ sum( x .* (tau_mat .* wzet[1:N]) .^ (1-sig), dims = 1 )) .* wzet[1:N]', which has the structure yrowvec .* matrix ./ srowvec .* wrowvec. This is algebraically equivalent to rowvec = yrowvec ./ srowvec .* wrowvec; rowvec .* matrix, but if you combine the three row vectors first then you save O(N^2) operations.

Furthermore, the first N-1 rows of the matrix (tau_mat .* wzet[1:N]) .^ (1-sig) have N identical columns, i.e. they are of the form w o^T, where o is the vector of 1’s. If you handle the last row separately, then rowvec .* u .* o' is equivalent to rowvec .* u. Your code then sums this over the columns, which is equivalent to sum(rowvec) * u, which avoids the matrix entirely.

In other words, it looks like your algorithm is currently O(N^2), but can be re-arranged to be O(N).

PS. No need for a semicolon after every line in a Julia function.

5 Likes

What do you mean by the order of approximation when you use automatic/algorithmic differentiation (AD)?

I have not really been following the thread, but since OP is talking about an economic model then I’ll take a guess. The model solution involves some unknown nonlinear functions. One way to solve it is to approximate those nonlinear functions with essentially Taylor series approximations of various orders. A general name for the solution method is perturbation.

3 Likes

probably not, you have @views in front so it should not allocate. The allocation probably comes from how you benchmark it, are you using global variables in the expression @btime @views broadcast!(.^, tau_mat, wzet[N+1], kappa); ?

here you’re allocating a vector each iteration of the loop simply to put the index i into it. You can use i:i (a range costs no allocations) instead if you want to keep the matrix when slicing.

More generally about the sum, Steven’s post details how you could rewrite the sum as loops and get rid of most arrays completely

Thanks to all for great discussions.

@stevengj Thanks so much for your comments. The two PS’s are very useful, so I will follow them. Here are my thoughts on your main points:

In particular, I don’t see why you need tau_mat at all.

I need tau_mat because I will consider a more generalized exercise later when I can resolve this computational issue. As you pointed out, it is currently a simple matrix with a straightforward structure. In the real application, we plan to perform more complex and interesting counterfactual analysis with different values in each element of tau_mat. We have discussed carefully and thought that we should keep this tau_mat for the purpose of our theory.

Furthermore, the first N-1 rows of the matrix (tau_mat .* wzet[1:N]) .^ (1-sig) have N identical columns,

Thanks for the suggestion, but due to the point above, I don’t think I can follow this one.

This is algebraically equivalent to rowvec = yrowvec ./ srowvec .* wrowvec; rowvec .* matrix, but if you combine the three row vectors first then you save O(N^2) operations.

I don’t follow this. y is a column vector, and sum(..., dims = 1) gives a row vector. So reordering does not reduce the computational order. Correct me if I am wrong.

@zdenek_hurak @tbeason Thanks for the discussion. @tbeason’s point is perfectly correct. Speaking of Taylor expansion, I have thought about applying TaylorSeries.jl to our problem, but I have not done it yet since I thought this allocation issue is of the first order importance, whatever the package I use.

@baggepinnen Thanks for the follow-up. I am not sure about the global variable here since I am not in a function and simply benchmarking in the main code. Plus I have tried the following micro code and found that it is avoiding slicing that helps but not @views.

using BenchmarkTools
N = 44;             # number of countries
kappa = vcat(zeros(N-1,N), [ones(N-1); 0]'); # the kernel of the trade cost change
tau_mat = ones(N,N);
wzet = ones(N+1);   # initial point
zet = 1;

# baseline
# @views does not help
@btime @views broadcast!(.^, tau_mat, wzet[N+1], kappa);
# avoiding slicing does help
@btime @views broadcast!(.^, tau_mat, zet, kappa);


You can use i:i (a range costs no allocations) instead if you want to keep the matrix when slicing.

Thanks, this is very helpful. I have followed your point and the allocation reduced from 23.625 μs (179 allocations: 74.94 KiB) to 22.042 μs (91 allocations: 69.44 KiB), which I think is quite an improvement. My current MWE is the following:

using FiniteDifferences, LinearAlgebra, BenchmarkTools

#---------------------------------------
# internal functions
#---------------------------------------

# function about equilibrium condition (excess labor demand)
@views function eha_wzet_combined(wzet,y,x,kappa,sig,N,tau_mat,cma,cld)
# define the matrix dimension

# armington condition
# 1. compute the change in the market access (cma)
cma_mat = x .* (tau_mat .* wzet[1:N]) .^ (1-sig)
for i = 1:N
cma += cma_mat[i:i,:];
end
# 2. sum over the change in labor demand across destination (cld)
cld_mat = y .* (tau_mat .* wzet[1:N]) .^ (1-sig) .* wzet[1:N]' ./ cma
for j = 1:N
cld += cld_mat[:,j:j]
end
excess = wzet[1:N] - cld
return excess
end

#---------------------------------------
# parameters and data
#---------------------------------------

N = 44;             # number of countries
sig = 4;            # elasticity of substitution

homebias = 0.9;     # the size of the domestic consumption tendency
X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N);
y = X./sum(X,dims=2);
x = X./sum(X,dims=1);

kappa = vcat(zeros(N-1,N), [ones(N-1); 0]'); # the kernel of the trade cost change
tau_mat = ones(N,N);

wzet = ones(N+1);   # initial point

# pre-allocate
cma = zeros(1,N);   # change in the market access
cld = zeros(N,1);   # change in labor demand

#---------------------------------------
# GO!
#---------------------------------------

@btime eha_wzet_combined(wzet,y,x,kappa,sig,N,tau_mat,cma,cld);


Then you are using global variables in the benchmark expression. Have a look at
https://juliaci.github.io/BenchmarkTools.jl/stable/manual/#Interpolating-values-into-benchmark-expressions

the solution is to interpolate the global variables into the benchmarked expression, e.g., using the $ signs like this @btime broadcast!(.^,$tau_mat, $(wzet)[N+1],$kappa)

1 Like

the solution is to interpolate the global variables into the benchmarked expression, e.g., using the $ signs Thanks again, this is great. I will do this later when benchmarking My another concern is the efficiency of FiniteDifferences.jl. As I mentioned above, currently my function allocates 91. When I @btime jacobian(central_fdm(5,1), myfunction, myevaluationvalue), the allocation becomes 55932. So taking Jacobian allocates with more than 600 times as the evaluation of the original function. To dig this point a bit deeper, I have tried the following MWE, and found that even with this simple function that allocates zero, FiniteDifferences.jl’s Jacobian allocates 116. Is this as expected? If so, I am worried about the use of FiniteDifferences.jl since I would like to take the high order derivative as I mentioned in the beginning, so that allocation in a single Jacobian may increase at an exponential order. f(x) = x^2; @btime f(2.) # this allocates 0 @btime @views jacobian($central_fdm(5,1), $f, 2.)[1] # this allocates 116  The number of allocations is not the only important thing, the size of the allocations also matters (the amount of memory). FiniteDifferences.jl uses an adaptive method to compute accurate derivatives, are you sure you need a 5:th order method? Perhaps you could get away with central_fdm(3,1)? You may also consider which claims to provide Fast non-allocating calculations of gradients, Jacobians, and Hessians with sparsity support Actually, it looks like both of us are wrong? y = X./sum(X,dims=2) is a square matrix in your example code. So I agree that you can’t get much savings at that point (except by exploiting special structure of the matrices, but you say that your matrices will eventually not have any rank-1 structure in your real application). @baggepinnen Thanks. I have tried central_fdm(3,1), but it did not look to improve the performance well, at least in the sense of the number of allocations. Plus, we would like to compute high order derivatives such as 4th order in the future. I have also tried to use FiniteDiff.jl, but I gave it up because (i) we need to compute higher-order derivatives than Hessian, and FiniteDifferences.jl say they provide it, and (ii) the results are unstable for reasons that I don’t know. @stevengj Sorry, you are right. y’ is not either a column or row vector, but a N\times N matrix. Another point I have to consider is an efficient implementation of the sum of the terms that contain second-order derivatives. Specifically, I need to compute \sum_{j=1}^{N-1}\frac{\partial D_{w}F(w,\zeta)}{\partial w_{j}}\frac{dg_{j}(\zeta)}{d\zeta}, where w \in \mathbb{R}^{N-1} is a vector, D_{w}F(w,\zeta) is a Jacobian matrix of a function F(\cdot, \zeta): \mathbb{R}^{N-1} \rightarrow \mathbb{R} with respect to w for any \zeta, and g is a N-1 vector of function g_j: \mathbb{R}^{N-1} \rightarrow \mathbb{R} for j=1,\ldots,N-1. My current implementation is the following, using a function sum_dDyFdyj_D11. As I mentioned at the beginning of this thread, we want N=44 in the end, but let’s do N=3 for now. The last line, @btime sum_dDyFdyj_D11($wzet,$N,$ddw,$Dg1), yields 4.741 ms (108907 allocations: 8.56 MiB). Given the large number of allocations in my function eha_wzet_combined and FiniteDifferences.jl, I am not particularly surprised by >100k allocations. But do you think my implementation of the sum of the terms involving second derivatives is efficient? Should it be improved? Thanks so much for your patience! #--------------------------------------- # housekeeping #--------------------------------------- using FiniteDifferences, Profile, LinearAlgebra, BenchmarkTools #--------------------------------------- # internal functions #--------------------------------------- # function about equilibrium condition (excess labor demand) @views function eha_wzet_combined(wzet,y,x,kappa,sig,N,tau_mat,cma,cld) # define the matrix dimension broadcast!(.^, tau_mat, wzet[N+1], kappa) # armington condition # 1. compute the change in the market access (cma) cma_mat = x .* (tau_mat .* wzet[1:N]) .^ (1-sig) for i = 1:N cma += cma_mat[i:i,:]; end # 2. sum over the change in labor demand across destination (cld) cld_mat = y .* (tau_mat .* wzet[1:N]) .^ (1-sig) .* wzet[1:N]' ./ cma for j = 1:N cld += cld_mat[:,j:j] end excess = wzet[1:N] - cld return excess end # function to compute the sum of 2nd order derivatives @views function sum_dDyFdyj_D11(wzet,N,ddw,Dg1) out = ddw[1](wzet)[1:N-1,1:N-1] * Dg1(wzet)[1] if N > 2 for j in 2:N-1 out += ddw[j](wzet)[1:N-1,1:N-1] * Dg1(wzet)[j]; end end return out end #--------------------------------------- # parameters and data #--------------------------------------- N = 3; # number of countries sig = 4; # elasticity of substitution homebias = 0.9; # the size of the domestic consumption tendency X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N); y = X./sum(X,dims=2); x = X./sum(X,dims=1); # trade cost setting kappa = vcat(zeros(N-1,N), [ones(N-1); 0]'); # the kernel of the trade cost change tau_mat = ones(N,N); # pre-allocate the tau matrix wzet = ones(N+1); # initial point # pre-allocate cma = zeros(1,N); # change in the market access cld = zeros(N,1); # change in labor demand #--------------------------------------- # 1st order solution #--------------------------------------- ed(wzet) = eha_wzet_combined(wzet,y,x,kappa,sig,N,tau_mat,cma,cld); @views dfwzet(wzet) = jacobian(central_fdm(5,1), ed, wzet)[1]; @views Dg1(wzet) = - dfwzet(wzet)[1:N-1,1:N-1] \ dfwzet(wzet)[1:N-1,N+1]; @btime Dg1(wzet) #--------------------------------------- # 2nd order solution #--------------------------------------- @views dfzet = wzet -> dfwzet(wzet)[:,N+1]; @views ddfzwzet(wzet) = jacobian(central_fdm(5,1), dfzet, wzet)[1]; ddw = Array{Function}(undef, N-1) for j in 1:N-1 @views dfwj = wzet -> dfwzet(wzet)[:,j]; @views ddw[j] = wzet -> jacobian(central_fdm(5,1), dfwj, wzet)[1] end @btime sum_dDyFdyj_D11($wzet,$N,$ddw,$Dg1)  1 Like Hi again, I have been struggling with the problem. After a break, I tried to look at the problem from a different angle. Namely, I tried to compute two mathematically equivalent but programmatically different functions and compare the performance. Below, es_loop computes my equilibrium condition using the loop that @baggepinnen suggested, but es_sum does the same using sum function that I used to use in Matlab The result shows the same number of allocations with slightly better computation time and total memory size. Could you help me figure out why this is the case? Why does not the loop method dominate the sum method when loop is said to be more efficient than sum function? Am I missing something? Thanks again! Daisuke #--------------------------------------- # housekeeping #--------------------------------------- using FiniteDifferences, Profile, LinearAlgebra, BenchmarkTools #--------------------------------------- # internal functions #--------------------------------------- # function about equilibrium condition (excess labor demand) @views function es_loop(wzet,y,x,kappa,sig,N,tau_mat,cma,cld) # define the matrix dimension broadcast!(.^, tau_mat, wzet[N+1], kappa) # armington condition # 1. compute the change in the market access (cma) cma_mat = x .* (tau_mat .* wzet[1:N]) .^ (1-sig) for i = 1:N cma += cma_mat[i:i,:]; end # 2. sum over the change in labor demand across destination (cld) cld_mat = y .* (tau_mat .* wzet[1:N]) .^ (1-sig) .* wzet[1:N]' ./ cma for j = 1:N cld += cld_mat[:,j:j] end excess = wzet[1:N] - cld return excess end @views function es_sum(wzet,y,x,kappa,sig,N,tau_mat) # define the matrix dimension #@views tau_mat = vcat(ones(N-1,N), [wzet[N+1]*ones(N-1); 1]'); broadcast!(.^, tau_mat, wzet[N+1], kappa); # armington condition excess = wzet[1:N] - sum( y .* ((tau_mat .* wzet[1:N]) .^ (1-sig) ./ sum( x .* (tau_mat .* wzet[1:N]) .^ (1-sig), dims = 1 )) .* wzet[1:N]', dims = 2 )[:,1]; return excess end #--------------------------------------- # parameters and data #--------------------------------------- N = 3; # number of countries sig = 4; # elasticity of substitution homebias = 0.9; # the size of the domestic consumption tendency X = homebias * Matrix{Float64}(I,N,N) + (1-homebias) * 1/N * ones(N,N); y = X./sum(X,dims=2); x = X./sum(X,dims=1); # trade cost setting kappa = vcat(zeros(N-1,N), [ones(N-1); 0]'); # the kernel of the trade cost change tau_mat = ones(N,N); # pre-allocate the tau matrix wzet = ones(N+1); # initial point # pre-allocate cma = zeros(1,N); # change in the market access cld = zeros(N,1); # change in labor demand #--------------------------------------- # check performance #--------------------------------------- @btime$es_loop(wzet,y,x,kappa,sig,N,tau_mat,cma,cld);

@btime $es_sum(wzet,y,x,kappa,sig,N,tau_mat);  The results from @btime$es_loop is
463.832 ns (9 allocations: 816 bytes)

The results from @btime \$es_sum is
372.766 ns (9 allocations: 576 bytes)

You might have to use a dot to make the operation in-place

cma .+= cma_mat[i:i,:];
`