Fortran calling Julia. Julia 10x slower than Fortran

Fortran calling Julia. Julia 10x slower than Fortran

Dear Julia community,

I am a beginner programmer and I enjoyed learning and coding in Julia a lot.

I have been working on a Julia interface for Finite Element Analysis Program (FEAP), that is in Fortran, and we want to use Julia subroutines for material models. The end goal is to use the program Fortran + Julia in a HPC and the Julia functions will be called millions of times that is why performance is crucial.

I made a minimal working example of calling Julia from Fortran, by making some adaptations to the chapter Embedding Julia · The Julia Language. The Fortran program starts Julia and load the Julia Cfunctions, after that it calls cwrapper_foo1 (Julia Cfunction), then calls the equivalent Fortran subroutine fortran_foo1 that does the same operations, both functions are called n = 10^7 times.

Results by reading the system time and calculating the differences:

  • Julia run: ~9.0 seconds
  • Fortran run: ~0.5 seconds

(using valgrind I get a similar ratio)

When benchmarking just the Julia code with BenchmarkTools.jl, I get that around half of the time is spent in unsafe_load and unsafe_store!.

So my question is:

  1. Is there a better way to transfer or share the data between Fortran-Julia to get better performance?
  2. Does jl_init() and jl_atexit_hook(0) be call once or every time the Julia Cfunction is called?
  3. Is it possible to remove the cwrapper and call the Julia Cfunction directly from Fortran?
  4. When Profiling, what valgrind shows is not understandable after calling a Julia function. What would be a proper way to profile Fortran/C + Julia?


Julia Cfunctions. File: cfunctions.jl
function foo1(
    A::Ptr{Cdouble}, B::Ptr{Cdouble}, C::Ptr{Cdouble},          # input pointers
    X::Ptr{Cdouble}, Y::Ptr{Cdouble},                           # output pointers
    N::Ptr{Cint}
)
    # Loading parameters
    # *************************************************************************************
    n = unsafe_load(N, 1)

    a1 = unsafe_load(A, 1)
    a2 = unsafe_load(A, 2)

    b1 = Array{Float64, 1}(undef, 6)
    b2 = Array{Float64, 1}(undef, 6)
    for i = 1:6
        b1[i] = unsafe_load(B, i)
        b2[i] = unsafe_load(B, i+6)
    end # i 

    c1 = Array{Float64, 2}(undef, 3, 3)
    c2 = Array{Float64, 2}(undef, 3, 3)
    c3 = Array{Float64, 2}(undef, 3, 3)
    c4 = Array{Float64, 2}(undef, 3, 3)
    k = 0
    for i = 1:3, j = 1:3
        k = k + 1
        c1[i,j] = unsafe_load(C, k + 9*0)
        c2[i,j] = unsafe_load(C, k + 9*1)
        c3[i,j] = unsafe_load(C, k + 9*2)
        c4[i,j] = unsafe_load(C, k + 9*3)
    end # i, j 

    # Calculations
    # *************************************************************************************
    E = a1
    ν = a2
    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))

    y = zeros(Float64, 6, 6)
    for i = 1:3
        y[i,i] = 2 * μ + λ
        y[i+3,i+3] = μ
    end # i
    y[1,2] = λ
    y[1,3] = λ
    y[2,1] = λ
    y[2,3] = λ
    y[3,1] = λ
    y[3,2] = λ

    x = zeros(Float64, 6)
    x[1] = (2 * μ + λ) * b1[1] + λ * b1[2] + λ * b1[3]
    x[2] = (2 * μ + λ) * b1[2] + λ * b1[1] + λ * b1[3]
    x[3] = (2 * μ + λ) * b1[3] + λ * b1[2] + λ * b1[1]
    x[4] = 2 * μ * b1[4] 
    x[5] = 2 * μ * b1[5] 
    x[6] = 2 * μ * b1[6] 

    # Storing parameters
    # *************************************************************************************
    for i = 1:6
        unsafe_store!(X, x[i], i)
    end # i
    k = 0
    for i = 1:6, j = 1:6
        k = k + 1
        unsafe_store!(Y, y[i,j], k)
    end # i, j

    return nothing
end # foo1

const julia_foo1 = @cfunction(
    foo1, 
    Cvoid, 
    (
        Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, 
        Ptr{Cdouble}, Ptr{Cdouble},
        Ptr{Cint},
    )
)
C intermediate layer. File: c_layer.c
#include <julia.h>
JULIA_DEFINE_FAST_TLS()

#include <stdio.h>
#include <stdint.h>

void *get_cfunction_pointer(const char *name);
void (*c_foo1_)(double *a, double **b, double ***c, double *x, double **y, int *n);

int julia_initialize_(){
    jl_init();
    printf("Start Julia\n");

    return 0;
}

int julia_exit_(){
    jl_atexit_hook(0);
    printf("Exit Julia\n");

    return 0;
}

int julia_load_cfunctions_(){
    printf("Loading Julia @cfunctions\n");

    /* import the file cfunctions.jl */
    jl_eval_string("Base.include(Main, \"cfunctions.jl\")");

    /* convert Julia function to C function with @cfuntion */
    c_foo1_ = get_cfunction_pointer("julia_foo1");

    printf("Finish loading Julia @cfunction\n");

    return 0;
}

void *get_cfunction_pointer(const char *name){
    void *p = 0;
    jl_value_t *boxed_pointer = jl_get_global(jl_main_module, jl_symbol(name));
    p = jl_unbox_voidpointer(boxed_pointer);

    return p;
}

int cwrapper_foo1_(double *a, double **b, double ***c, double *x, double **y, int *n){
    c_foo1_(a, b, c, x, y, n);

    return 0;
}
Fortran program. File: main.f
      program main
      implicit none
      real (kind=8) :: a(2), b(6,2), c(3,3,4)       ! input 
      real (kind=8) :: x(6), y(6,6), x1(6), y1(6,6) ! output 
      integer       :: n
      integer       :: i,j,k

      integer       :: t1, t2, rate
      real (kind=8) :: time

      ! Initialize parameters
      n = 10**7
      a = 0.0
      b = 0.0
      c = 0.0
      x = 0.0
      y = 0.0
      x1 = 0.0
      y1 = 0.0
      a(1) = 1000
      a(2) = 0.30
      b(1,1) = 0.02
      b(2,1) = -0.006
      b(3,1) = -0.006

      ! Start Julia and load Julia Cfunctions
      call julia_initialize()
      call julia_load_cfunctions()
      
      ! Compile Julia function
      call cwrapper_foo1(a, b, c, x, y, n)

      write(*,*) 'n = ', n
      ! Julia loop
      ! *************************************************************
      call system_clock(t1)

      do i=1,n
      ! more Fortran code... lots lots of operations!!!
      call cwrapper_foo1(a, b, c, x, y, n)
      !call c_foo1(a, b, c, x, y, n) ! does not work atm
      ! more Fortran code...
      end do ! i

      call system_clock(t2, rate)
      time = (t2 - t1) / float(rate)
      write(*,*) 'julia run: ', time, 'seconds'

      ! Fortran loop
      ! *************************************************************
      call system_clock(t1)

      do i=1,n
      ! more Fortran code... many operations!!
      call fortran_foo1(a, b, c, x1, y1, n)
      ! more Fortran code...
      end do ! i

      call system_clock(t2, rate)
      time = (t2 - t1) / float(rate)
      write(*,*) 'fortran run: ', time, 'seconds'

      !write(*,*) x, y
      !write(*,*) x1, y1

      call julia_exit()

      end program main



      subroutine fortran_foo1(a, b, c, x, y, n)
      implicit none
      real (kind=8) :: a(2), b(6,2), c(3,3,4) ! input 
      real (kind=8) :: x(6), y(6,6)           ! output
      integer       :: n 
      integer       :: i,j,k
      real (kind=8) :: E, nu, lambda, mu
      
      E = a(1)
      nu = a(2)
      lambda = E*nu / ((1 + nu)*(1 - 2*nu))
      mu = E / (2*(1 + nu))

      do i=1,3
            y(i,i) = 2*mu + lambda
            y(i+3,i+3) = mu
      end do ! i

      y(1,2) = lambda
      y(1,3) = lambda
      y(2,1) = lambda
      y(2,3) = lambda
      y(3,1) = lambda
      y(3,2) = lambda

      x(1) = (2*mu + lambda)*b(1,1) + lambda*b(2,1) + lambda*b(3,1)
      x(2) = (2*mu + lambda)*b(2,1) + lambda*b(1,1) + lambda*b(3,1)
      x(3) = (2*mu + lambda)*b(3,1) + lambda*b(2,1) + lambda*b(1,1)
      x(4) = 2*mu*b(4,1)
      x(5) = 2*mu*b(5,1)
      x(6) = 2*mu*b(6,1)

      end subroutine fortran_foo1 
makefile
FF=gfortran
CC=gcc

JL_SHARE = $(shell julia -e 'print(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia"))')
CFLAGS   += $(shell $(JL_SHARE)/julia-config.jl --cflags)
CXXFLAGS += $(shell $(JL_SHARE)/julia-config.jl --cflags)
LDFLAGS  += $(shell $(JL_SHARE)/julia-config.jl --ldflags)
LDLIBS   += $(shell $(JL_SHARE)/julia-config.jl --ldlibs)

all: main

main: main.o c_layer.o
	$(FF) -o $@ $^ $(LDFLAGS) $(LDLIBS)

%.o: %.f
	$(FF) -c $< -o $@ -g

%.o: %.c
	$(CC) -c $(CFLAGS) $< -o $@ -g

.PHONY: main clean

clean:
	rm -f main *.o
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Any other recommendation is welcome :smiley:

Thank you very much in advance for the help!

2 Likes

First of all, welcome to Julia.

To approach this problem, I think we need to simplify and just focus on the Julia portion first. There are two things that I would immediately consider.

  1. I would recommend using unsafe_wrap rather than unsafe_load. This way you do not have to duplicate the data as you are doing now by loading data via pointer into a Julia array. unsafe_wrap allows you to create a Julia array that points at existing memory.

  2. Your code is not in a module with precompile statements. This means the time that you are measuring includes compilation. For your Fortran code, did you include the time it took to compile the program? I am guessing not. Julia uses modules to cache compilation.

My recommendation is to first write rewrite the Fortran function main.f completely in Julia, and then optimize that. Once we have a performant Julia program, we can work on executing that efficiently and passing data to Julia from Fortran.

3 Likes

Thank you for the quick response.

I did not use unsafe_wrap because one pointer stores several objects in my case. In the example I called them c1, c2, ..., but in reality would be strain, stress, .... Is it possible to use unsafe_wrap to do this?

Thanks! I will read the post. At the moment, the julia function is ran one time before entering the “benchmark” loop, is this not enough?

I did not understood this part. Do you mean doing a Julia callback Fortran? with ccall.

1 Like

Yes. Basically, I would address the contiguous memory block as a single Julia array using unsafe_wrap. Then I would use @view or view to obtain the SubArrays of that Julia array as you need.

This is not sufficient since you are no longer within the same Julia session. The compiled code is only cached in memory in this case. Without a module, Julia does not have a way of caching the compiled code to disk.

I mean let’s write this entire example in only Julia to make sure that we have an example of performant Julia code. This forum will be most useful in helping you to do that.

Once we have a fast Julia function, then we can figure out how to call it efficiently from Fortran.

1 Like

Using unsafe_wrap was a big improvement. But I did not see any difference between making it into a Julia module or adding precompile statements.

Before:

 julia run:    9.5559997558593750      seconds
 fortran run:   0.37900000810623169      seconds

Using unsafe_wrap and @views:

 julia run:    4.8889999389648438      seconds
 fortran run:   0.35600000619888306      seconds

Do you know what other improvement can I make to make loading/storing have better performance? unsafe_wrap/store! takes around 60% of the time in the Julia code.

How are you loading the Julia code? You should use import or using to include the module that is on your LOAD_PATH rather than using include to load the code.

https://docs.julialang.org/en/v1/manual/modules/#Standalone-using-and-import

Could you please post an updated version of your Julia code?

Yes, I am using using, for that I also had to change the file c_layer.c to be able to do this.

Here are the updated files:

c_layer.c
#include <julia.h>
JULIA_DEFINE_FAST_TLS()

#include <stdio.h>
#include <stdint.h>

typedef void (*foo_ptr)(double *, double **, double ***, double *, double **, int *);
foo_ptr c_foo1_ = NULL;

int julia_initialize_(){
    jl_init();
    printf("Start Julia\n");

    return 0;
}

int julia_exit_(){
    jl_atexit_hook(0);
    printf("Exit Julia\n");

    return 0;
}

int julia_load_cfunctions_(){
    printf("Loading Julia @cfunctions\n");

    /* import Julia module FortranCallJulia */
    jl_eval_string("push!(LOAD_PATH,\".\")");
    jl_eval_string("using FortranCallJulia");

    /* get Julia function */
    jl_value_t *boxed_pointer = jl_eval_string("@cfunction(foo1, Cvoid, (Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cint},))");
    c_foo1_ = (foo_ptr) jl_unbox_voidpointer(boxed_pointer);

    printf("Finish loading Julia @cfunction\n");

    return 0;
}

int cwrapper_foo1_(double *a, double **b, double ***c, double *x, double **y, int *n){
    c_foo1_(a, b, c, x, y, n);

    return 0;
}
The file cfunction.jl\ replaced for FortranCallJulia.jl
module FortranCallJulia

export julia_foo1, foo1

function foo1(
    A::Ptr{Cdouble}, B::Ptr{Cdouble}, C::Ptr{Cdouble},          # input pointers
    X::Ptr{Cdouble}, Y::Ptr{Cdouble},                           # output pointers
    N::Ptr{Cint}
)

    # Loading parameters
    # *************************************************************************************
    n = unsafe_load(N, 1)

    a1 = unsafe_load(A, 1); a2 = unsafe_load(A, 2)

    b = unsafe_wrap(Array{Float64, 2}, B, (6, 2), own=false)
    b1 = @views b[:, 1]; b2 = @views b[:, 2]

    c = unsafe_wrap(Array{Float64, 3}, C, (3, 3, 4), own=false)
    c1 = @views c[:, :, 1]; c2 = @views c[:, :, 2]; c3 = @views c[:, :, 3]; c4 = @views c[:, :, 4]

    # Calculations
    # *************************************************************************************
    E = a1
    ν = a2
    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))

    y = zeros(Float64, 6, 6)
    for i = 1:3
        y[i,i] = 2 * μ + λ
        y[i+3,i+3] = μ
    end # i
    y[1,2] = λ
    y[1,3] = λ
    y[2,1] = λ
    y[2,3] = λ
    y[3,1] = λ
    y[3,2] = λ

    x = zeros(Float64, 6)
    x[1] = (2 * μ + λ) * b1[1] + λ * b1[2] + λ * b1[3]
    x[2] = (2 * μ + λ) * b1[2] + λ * b1[1] + λ * b1[3]
    x[3] = (2 * μ + λ) * b1[3] + λ * b1[2] + λ * b1[1]
    x[4] = 2 * μ * b1[4] 
    x[5] = 2 * μ * b1[5] 
    x[6] = 2 * μ * b1[6] 
    
    # Storing parameters
    # *************************************************************************************
    for i = 1:6
        unsafe_store!(X, x[i], i)
    end # i
    k = 0
    for i = 1:6, j = 1:6
        k = k + 1
        unsafe_store!(Y, y[i,j], k)
    end # i, j

    return nothing
end # foo1

const julia_foo1 = @cfunction(
    foo1, 
    Cvoid, 
    (
        Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, 
        Ptr{Cdouble}, Ptr{Cdouble},
        Ptr{Cint},
    )
)

precompile(foo1, (Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, 
    Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cint})
)
#precompile(+, (Float64, Float64))
#precompile(*, (Float64, Float64))
#precompile(unsafe_load, (Ptr{Cdouble}, Int64))
#precompile(unsafe_wrap, (Array{Float64, 3}, Ptr{Cdouble}, Int64, Int64, Int64))
#precompile(unsafe_wrap, (Array{Float64, 2}, Ptr{Cdouble}, Int64, Int64))
end # FortranCallJulia
2 Likes

You could also unsafe_wrap the output arrays X and Y.

Are the arguments for cwrapper_foo1 correct? I believe that even the multidimensional arrays should be single pointers, not double or triple pointers. In both Julia and Fortran, arrays are contiguous sequences of memory. A multidimensional array is not an array of arrays.

You are correct, they should be single pointers. I have incorporate both of your comments now.
I am understanding more how it works. Thank you!

I did not saw any improvement in performance with the last two changes.

Is it possible to use @cfunction on the julia side and call that function from C? I think that way will be more clear.

FortranCallJulia.jl
module FortranCallJulia

export julia_foo1, foo1

function foo1(
    A::Ptr{Cdouble}, B::Ptr{Cdouble}, C::Ptr{Cdouble},          # input pointers
    X::Ptr{Cdouble}, Y::Ptr{Cdouble},                           # output pointers
    N::Ptr{Cint}
)

    # Loading parameters
    # *************************************************************************************
    n = unsafe_load(N, 1)

    a1 = unsafe_load(A, 1)
    a2 = unsafe_load(A, 2)

    b = unsafe_wrap(Array{Float64, 2}, B, (6, 2), own=false)
    b1 = @view b[:, 1]
    b2 = @view b[:, 2]

    c = unsafe_wrap(Array{Float64, 3}, C, (3, 3, 4), own=false)
    c1 = @view c[:, :, 1]
    c2 = @view c[:, :, 2]
    c3 = @view c[:, :, 3]
    c4 = @view c[:, :, 4]

    x = unsafe_wrap(Array{Float64, 1}, X, 6, own=false)
    y = unsafe_wrap(Array{Float64, 2}, Y, (6, 6), own=false)

    # Calculations
    # *************************************************************************************
    E = a1
    ν = a2
    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))

    for i = 1:3
        y[i,i] = 2 * μ + λ
        y[i+3,i+3] = μ
    end # i
    y[1,2] = λ
    y[1,3] = λ
    y[2,1] = λ
    y[2,3] = λ
    y[3,1] = λ
    y[3,2] = λ

    x[1] = (2 * μ + λ) * b1[1] + λ * b1[2] + λ * b1[3]
    x[2] = (2 * μ + λ) * b1[2] + λ * b1[1] + λ * b1[3]
    x[3] = (2 * μ + λ) * b1[3] + λ * b1[2] + λ * b1[1]
    x[4] = 2 * μ * b1[4] 
    x[5] = 2 * μ * b1[5] 
    x[6] = 2 * μ * b1[6] 

    return nothing
end # foo1

const julia_foo1 = @cfunction(
    foo1, 
    Cvoid, 
    (
        Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, 
        Ptr{Cdouble}, Ptr{Cdouble},
        Ptr{Cint},
    )
)

precompile(foo1, (Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, 
    Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cint})
)
#precompile(+, (Float64, Float64))
#precompile(*, (Float64, Float64))
#precompile(unsafe_load, (Ptr{Cdouble}, Int64))
#precompile(unsafe_wrap, (Array{Float64, 3}, Ptr{Cdouble}, Int64, Int64, Int64))
#precompile(unsafe_wrap, (Array{Float64, 2}, Ptr{Cdouble}, Int64, Int64))
end # FortranCallJulia
c_layer.c
#include <julia.h>
JULIA_DEFINE_FAST_TLS()

#include <stdio.h>
#include <stdint.h>

typedef void (*foo_ptr)(double *, double *, double *, double *, double *, int *);
foo_ptr c_foo1_ = NULL;

int julia_initialize_(){
    jl_init();
    printf("Start Julia\n");

    return 0;
}

int julia_exit_(){
    jl_atexit_hook(0);
    printf("Exit Julia\n");

    return 0;
}

int julia_load_cfunctions_(){
    printf("Loading Julia @cfunctions\n");

    /* import Julia module FortranCallJulia */
    jl_eval_string("push!(LOAD_PATH,\".\")");
    jl_eval_string("using FortranCallJulia");

    /* get Julia function */
    jl_value_t *boxed_pointer = jl_eval_string("@cfunction(foo1, Cvoid, (Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cint},))");
    c_foo1_ = (foo_ptr) jl_unbox_voidpointer(boxed_pointer);

    printf("Finish loading Julia @cfunction\n");

    return 0;
}

int cwrapper_foo1_(double *a, double *b, double *c, double *x, double *y, int *n){
    c_foo1_(a, b, c, x, y, n);

    return 0;
}
1 Like

Yes. All it produces is a C function pointer. You may need to use it in the module __init__, however, and store it in a const Ref{Ptr{nothing}}. I’m not sure you actually need cwrapper_foo_1_.

First I wanted to try to just make this a Julia problem, so we had something to optimize entirely within Julia:

julia> A = [1000, 0.30]
2-element Vector{Float64}:
 1000.0
    0.3

julia> B = zeros(6,2); B[1:3] = [0.02, -0.006, -0.006]; B
6×2 Matrix{Float64}:
  0.02   0.0
 -0.006  0.0
 -0.006  0.0
  0.0    0.0
  0.0    0.0
  0.0    0.0

julia> C = zeros(3,3,4)
3×3×4 Array{Float64, 3}:
[:, :, 1] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 2] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 3] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 4] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> X = zeros(6); Y = zeros(6,6);

julia> N = 10^7
10000000

julia> foo1_jl(A, B, C, X, Y, N)
0.0

julia> X
6-element Vector{Float64}:
 20.0
  0.0
  0.0
  0.0
  0.0
  0.0

julia> Y
6×6 Matrix{Float64}:
 1346.15    576.923   576.923    0.0      0.0      0.0
  576.923  1346.15    576.923    0.0      0.0      0.0
  576.923   576.923  1346.15     0.0      0.0      0.0
    0.0       0.0       0.0    384.615    0.0      0.0
    0.0       0.0       0.0      0.0    384.615    0.0
    0.0       0.0       0.0      0.0      0.0    384.615

julia> @benchmark foo1_jl($A, $B, $C, $X, $Y, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.443 ns (0.00% GC)
  median time:      34.462 ns (0.00% GC)
  mean time:        35.627 ns (0.00% GC)
  maximum time:     57.976 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> 1e9*1e-7
100.0

julia> foo1_benchmark(a, b, c, x, y, n) = for i=1:n; foo1_jl(a, b, c, x, y, n); end;

julia> @time foo1_benchmark(A, B, C, X, Y, N)
  0.374192 seconds (13.95 k allocations: 908.005 KiB, 2.78% compilation time)

Now having the function completely in Julia, I wanted to take a look at how it compiled, and I noticed pretty quickly that it was doing out of bounds checking for every indexing operation. As you can see in references to oob in the LLVM output:

@code_llvm output
julia> @code_llvm foo1_jl(A, B, C, X, Y, N)
;  @ REPL[273]:1 within `foo1_jl'
define double @julia_foo1_jl_2911({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, {}* nonnull align 16 dereferenceable(40) %3, {}* nonnull align 16 dereferenceable(40) %4, i64 signext %5) {
top:
  %gcframe290 = alloca [8 x {}*], align 16
  %gcframe290.sub = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe290, i64 0, i64 0
  %6 = bitcast [8 x {}*]* %gcframe290 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(64) %6, i8 0, i32 64, i1 false)
  %7 = alloca { [1 x [1 x i64]], i64 }, align 8
  %8 = alloca { [1 x [1 x i64]], i64 }, align 8
  %9 = alloca { [1 x [1 x i64]], [1 x [1 x i64]], i64 }, align 8
  %10 = alloca { [1 x [1 x i64]], [1 x [1 x i64]], i64 }, align 8
  %11 = alloca { [1 x [1 x i64]], [1 x [1 x i64]], i64 }, align 8
  %12 = alloca { [1 x [1 x i64]], [1 x [1 x i64]], i64 }, align 8
  %13 = alloca [1 x i64], align 8
  %14 = alloca [1 x i64], align 8
  %15 = alloca [1 x i64], align 8
  %16 = alloca [1 x i64], align 8
  %17 = alloca [1 x i64], align 8
  %18 = alloca [1 x i64], align 8
  %19 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %20 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %21 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %22 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %23 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %24 = alloca { {}*, { [1 x [1 x i64]], i64 }, i64, i64 }, align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #7
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -32768
;  @ REPL[273]:2 within `foo1_jl'
; ┌ @ array.jl:801 within `getindex'
   %25 = bitcast [8 x {}*]* %gcframe290 to i64*
   store i64 24, i64* %25, align 16
   %26 = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe290, i64 0, i64 1
   %27 = bitcast i8* %ptls_i8 to i64*
   %28 = load i64, i64* %27, align 8
   %29 = bitcast {}** %26 to i64*
   store i64 %28, i64* %29, align 8
   %30 = bitcast i8* %ptls_i8 to {}***
   store {}** %gcframe290.sub, {}*** %30, align 8
   %31 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
   %32 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %31, i64 0, i32 1
   %33 = load i64, i64* %32, align 8
   switch i64 %33, label %idxend2 [
    i64 0, label %oob
    i64 1, label %oob1
  ]

  ...

oob:                                              ; preds = %top
; └
;  @ REPL[273]:2 within `foo1_jl'
; ┌ @ array.jl:801 within `getindex'
   %113 = alloca i64, align 8
   store i64 1, i64* %113, align 8
   call void @jl_bounds_error_ints({}* %0, i64* nonnull %113, i64 1)
   unreachable

oob1:                                             ; preds = %top
; └
;  @ REPL[273]:3 within `foo1_jl'
; ┌ @ array.jl:801 within `getindex'
   %114 = alloca i64, align 8
   store i64 2, i64* %114, align 8
   call void @jl_bounds_error_ints({}* %0, i64* nonnull %114, i64 1)
   unreachable

This suggested that we should indicate to Julia that it should consider the getindex operations as inbounds as follows:

foo1_jl with @inbounds
julia> function foo1_jl(a, b, c, x, y, n)
           @inbounds begin
           a1 = a[1]
           a2 = a[2]
           b1 = @view b[:, 1]
           b2 = @view b[:, 2]
           c1 = @view c[:,:,1]
           c2 = @view c[:,:,2]
           c3 = @view c[:,:,3]
           c4 = @view c[:,:,4]
           end
           
           # Calculations
           # *************************************************************************************
           E = @inbounds a[1]
           ν = @inbounds a[2]
           λ = E * ν / ((1 + ν) * (1 - 2 * ν))
           μ = E / (2 * (1 + ν))

           @inbounds for i = 1:3
               y[i,i] = 2 * μ + λ
               y[i+3,i+3] = μ
           end # i
           @inbounds begin
           y[1,2] = λ
           y[1,3] = λ
           y[2,1] = λ
           y[2,3] = λ
           y[3,1] = λ
           y[3,2] = λ

           x[1] = (2 * μ + λ) * b1[1] + λ * b1[2] + λ * b1[3]
           x[2] = (2 * μ + λ) * b1[2] + λ * b1[1] + λ * b1[3]
           x[3] = (2 * μ + λ) * b1[3] + λ * b1[2] + λ * b1[1]
           x[4] = 2 * μ * b1[4] 
           x[5] = 2 * μ * b1[5] 
           x[6] = 2 * μ * b1[6]
           end 
       end
foo1_jl (generic function with 3 methods)
julia> @benchmark foo1_jl($A, $B, $C, $X, $Y, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.973 ns (0.00% GC)
  median time:      14.991 ns (0.00% GC)
  mean time:        15.094 ns (0.00% GC)
  maximum time:     36.786 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @time foo1_benchmark(A, B, C, X, Y, N)
  0.157990 seconds (13.94 k allocations: 907.380 KiB, 6.36% compilation time)

This halves the time from 0.374192 seconds without @inbounds to 0.157990 seconds with @inbounds.

The compiled code looks cleaner as well:

@code_llvm output with @inbounds
julia> @code_llvm foo1_jl(A, B, C, X, Y, N)
;  @ REPL[277]:1 within `foo1_jl'
define double @julia_foo1_jl_2963({}* nonnull align 16 dereferenceable(40) %0, {}* nonnull align 16 dereferenceable(40) %1, {}* nonnull align 16 dereferenceable(40) %2, {}* nonnull align 16 dereferenceable(40) %3, {}* nonnull align 16 dereferenceable(40) %4, i64 signext %5) {
top:
;  @ REPL[277]:15 within `foo1_jl'
; ┌ @ array.jl:801 within `getindex'
   %6 = bitcast {}* %0 to double**
   %7 = load double*, double** %6, align 8
   %8 = load double, double* %7, align 8
; └
;  @ REPL[277]:16 within `foo1_jl'
; ┌ @ array.jl:801 within `getindex'
   %9 = getelementptr inbounds double, double* %7, i64 1
   %10 = load double, double* %9, align 8
; └
;  @ REPL[277]:17 within `foo1_jl'
; ┌ @ float.jl:332 within `*'
   %11 = fmul double %8, %10
; └
; ┌ @ promotion.jl:321 within `+' @ float.jl:326
   %12 = fadd double %10, 1.000000e+00
; └
; ┌ @ promotion.jl:322 within `*' @ float.jl:332
   %13 = fmul double %10, 2.000000e+00
; └
; ┌ @ promotion.jl:323 within `-' @ float.jl:329
   %14 = fsub double 1.000000e+00, %13
; └
; ┌ @ float.jl:332 within `*'
   %15 = fmul double %12, %14
; └
; ┌ @ float.jl:335 within `/'
   %16 = fdiv double %11, %15
; └
;  @ REPL[277]:18 within `foo1_jl'
; ┌ @ promotion.jl:322 within `*' @ float.jl:332
   %17 = fmul double %12, 2.000000e+00
; └
; ┌ @ float.jl:335 within `/'
   %18 = fdiv double %8, %17
; └
;  @ REPL[277]:21 within `foo1_jl'
; ┌ @ promotion.jl:322 within `*' @ float.jl:0
   %19 = fmul double %18, 2.000000e+00
; └
; ┌ @ float.jl within `+'
   %20 = fadd double %19, %16
; └
; ┌ @ array.jl within `setindex!'
   %21 = bitcast {}* %4 to {}**
   %22 = getelementptr inbounds {}*, {}** %21, i64 3
   %23 = bitcast {}** %22 to i64*
   %24 = load i64, i64* %23, align 8
   %25 = bitcast {}* %4 to double**
   %26 = load double*, double** %25, align 8
; └
; ┌ @ array.jl:841 within `setindex!'
   store double %20, double* %26, align 8
; └
;  @ REPL[277]:22 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %27 = mul i64 %24, 3
   %28 = add i64 %27, 3
   %29 = getelementptr inbounds double, double* %26, i64 %28
   store double %18, double* %29, align 8
; └
;  @ REPL[277]:21 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %30 = add nuw nsw i64 %24, 1
   %31 = getelementptr inbounds double, double* %26, i64 %30
   store double %20, double* %31, align 8
; └
;  @ REPL[277]:22 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %32 = shl i64 %24, 2
   %33 = add i64 %32, 4
   %34 = getelementptr inbounds double, double* %26, i64 %33
   store double %18, double* %34, align 8
; └
;  @ REPL[277]:21 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %35 = shl nuw i64 %24, 1
   %36 = add nuw i64 %35, 2
   %37 = getelementptr inbounds double, double* %26, i64 %36
   store double %20, double* %37, align 8
; └
;  @ REPL[277]:22 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %38 = mul i64 %24, 5
   %39 = add i64 %38, 5
   %40 = getelementptr inbounds double, double* %26, i64 %39
   store double %18, double* %40, align 8
; └
;  @ REPL[277]:25 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %41 = getelementptr inbounds double, double* %26, i64 %24
   store double %16, double* %41, align 8
; └
;  @ REPL[277]:26 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %42 = getelementptr inbounds double, double* %26, i64 %35
   store double %16, double* %42, align 8
; └
;  @ REPL[277]:27 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %43 = getelementptr inbounds double, double* %26, i64 1
   store double %16, double* %43, align 8
; └
;  @ REPL[277]:28 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %44 = or i64 %35, 1
   %45 = getelementptr inbounds double, double* %26, i64 %44
   store double %16, double* %45, align 8
; └
;  @ REPL[277]:29 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %46 = getelementptr inbounds double, double* %26, i64 2
   store double %16, double* %46, align 8
; └
;  @ REPL[277]:30 within `foo1_jl'
; ┌ @ array.jl:841 within `setindex!'
   %47 = add nuw i64 %24, 2
   %48 = getelementptr inbounds double, double* %26, i64 %47
   store double %16, double* %48, align 8
; └
;  @ REPL[277]:32 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %49 = bitcast {}* %1 to double**
   %50 = load double*, double** %49, align 8
   %51 = load double, double* %50, align 8
; └
; ┌ @ float.jl:332 within `*'
   %52 = fmul double %20, %51
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %53 = getelementptr inbounds double, double* %50, i64 1
   %54 = load double, double* %53, align 8
; └
; ┌ @ float.jl:332 within `*'
   %55 = fmul double %16, %54
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %56 = getelementptr inbounds double, double* %50, i64 2
   %57 = load double, double* %56, align 8
; └
; ┌ @ float.jl:332 within `*'
   %58 = fmul double %16, %57
; └
; ┌ @ operators.jl:560 within `+' @ float.jl:326
   %59 = fadd double %52, %55
   %60 = fadd double %59, %58
; └
; ┌ @ array.jl:839 within `setindex!'
   %61 = bitcast {}* %3 to double**
   %62 = load double*, double** %61, align 8
   store double %60, double* %62, align 8
; └
;  @ REPL[277]:33 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %63 = load double, double* %53, align 8
; └
; ┌ @ float.jl:332 within `*'
   %64 = fmul double %20, %63
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %65 = load double, double* %50, align 8
; └
; ┌ @ float.jl:332 within `*'
   %66 = fmul double %16, %65
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %67 = load double, double* %56, align 8
; └
; ┌ @ float.jl:332 within `*'
   %68 = fmul double %16, %67
; └
; ┌ @ operators.jl:560 within `+' @ float.jl:326
   %69 = fadd double %64, %66
   %70 = fadd double %69, %68
; └
; ┌ @ array.jl:839 within `setindex!'
   %71 = getelementptr inbounds double, double* %62, i64 1
   store double %70, double* %71, align 8
; └
;  @ REPL[277]:34 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %72 = load double, double* %56, align 8
; └
; ┌ @ float.jl:332 within `*'
   %73 = fmul double %20, %72
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %74 = load double, double* %53, align 8
; └
; ┌ @ float.jl:332 within `*'
   %75 = fmul double %16, %74
; └
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %76 = load double, double* %50, align 8
; └
; ┌ @ float.jl:332 within `*'
   %77 = fmul double %16, %76
; └
; ┌ @ operators.jl:560 within `+' @ float.jl:326
   %78 = fadd double %73, %75
   %79 = fadd double %78, %77
; └
; ┌ @ array.jl:839 within `setindex!'
   %80 = getelementptr inbounds double, double* %62, i64 2
   store double %79, double* %80, align 8
; └
;  @ REPL[277]:35 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %81 = getelementptr inbounds double, double* %50, i64 3
   %82 = load double, double* %81, align 8
; └
; ┌ @ operators.jl:560 within `*' @ float.jl:332
   %83 = fmul double %19, %82
; └
; ┌ @ array.jl:839 within `setindex!'
   %84 = getelementptr inbounds double, double* %62, i64 3
   store double %83, double* %84, align 8
; └
;  @ REPL[277]:36 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %85 = getelementptr inbounds double, double* %50, i64 4
   %86 = load double, double* %85, align 8
; └
; ┌ @ operators.jl:560 within `*' @ float.jl:332
   %87 = fmul double %19, %86
; └
; ┌ @ array.jl:839 within `setindex!'
   %88 = getelementptr inbounds double, double* %62, i64 4
   store double %87, double* %88, align 8
; └
;  @ REPL[277]:37 within `foo1_jl'
; ┌ @ subarray.jl:309 within `getindex' @ array.jl:801
   %89 = getelementptr inbounds double, double* %50, i64 5
   %90 = load double, double* %89, align 8
; └
; ┌ @ operators.jl:560 within `*' @ float.jl:332
   %91 = fmul double %19, %90
; └
; ┌ @ array.jl:839 within `setindex!'
   %92 = getelementptr inbounds double, double* %62, i64 5
   store double %91, double* %92, align 8
; └
  ret double %91
}

Thanks, I will try to do that later.

I have added @inbounds to the function foo1.

The calculations were just as an example, in reality the calcs will use the package Tensors.jl or a similar pkg to take advantage of the notation, that is one of the main reason to use julia, and the function will end up looking similar to foo1_Tensors.

I also made the same function, purely in julia called foo1_jl_Tensors and I printed @code_llvm output but I do not understand it well enough to know how to make it better.

And the benchmarks I get the following:

julia> A_ptr = pointer(A)
...

julia> @benchmark foo1($A_ptr, $B_ptr, $C_ptr, $X_ptr, $Y_ptr, $N_ptr)
BenchmarkTools.Trial: 
  memory estimate:  320 bytes
  allocs estimate:  4
  --------------
  minimum time:     161.804 ns (0.00% GC)
  median time:      407.530 ns (0.00% GC)
  mean time:        486.977 ns (6.70% GC)
  maximum time:     9.709 μs (92.69% GC)
  --------------
  samples:          10000
  evals/sample:     759

julia> @benchmark foo1_Tensors($A_ptr, $B_ptr, $C_ptr, $X_ptr, $Y_ptr, $N_ptr)
BenchmarkTools.Trial: 
  memory estimate:  816 bytes
  allocs estimate:  6
  --------------
  minimum time:     589.864 ns (0.00% GC)
  median time:      954.068 ns (0.00% GC)
  mean time:        1.227 μs (7.22% GC)
  maximum time:     61.149 μs (97.00% GC)
  --------------
  samples:          10000
  evals/sample:     177

julia> @benchmark foo1_jl($A, $B, $C, $X, $Y, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.815 ns (0.00% GC)
  median time:      26.397 ns (0.00% GC)
  mean time:        32.397 ns (0.00% GC)
  maximum time:     3.134 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark foo1_jl_Tensors($A, $B, $C, $X, $Y, $N)
BenchmarkTools.Trial: 
  memory estimate:  496 bytes
  allocs estimate:  2
  --------------
  minimum time:     423.879 ns (0.00% GC)
  median time:      608.126 ns (0.00% GC)
  mean time:        784.921 ns (2.72% GC)
  maximum time:     50.334 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> @time foo1_benchmark(A_ptr, B_ptr, C_ptr, X_ptr, Y_ptr, N_ptr)
  7.250160 seconds (40.04 M allocations: 2.983 GiB, 13.44% gc time, 0.43% compilation time)

julia> @time foo1_Tensors_benchmark(A_ptr, B_ptr, C_ptr, X_ptr, Y_ptr, N_ptr)
 12.049414 seconds (60.04 M allocations: 7.602 GiB, 6.84% gc time, 0.23% compilation time)

julia> @time foo1_jl_benchmark(A, B, C, X, Y, N)
  0.241849 seconds (40.02 k allocations: 2.577 MiB, 16.51% compilation time)

julia> @time foo1_jl_Tensors_benchmark(A, B, C, X, Y, N)
  7.250791 seconds (20.04 M allocations: 4.622 GiB, 2.89% gc time, 0.42% compilation time)
foo1
function foo1(
    A::Ptr{Cdouble}, B::Ptr{Cdouble}, C::Ptr{Cdouble},          # input pointers
    X::Ptr{Cdouble}, Y::Ptr{Cdouble},                           # output pointers
    N::Ptr{Cint}
)
    # Loading parameters
    # *************************************************************************************
    @inbounds begin
    n = unsafe_load(N, 1)

    a1 = unsafe_load(A, 1)
    a2 = unsafe_load(A, 2)

    b = unsafe_wrap(Array{Float64, 2}, B, (6, 2), own=false)
    b1 = @view b[:, 1]
    b2 = @view b[:, 2]

    c = unsafe_wrap(Array{Float64, 3}, C, (3, 3, 4), own=false)
    c1 = @view c[:, :, 1]
    c2 = @view c[:, :, 2]
    c3 = @view c[:, :, 3]
    c4 = @view c[:, :, 4]

    x = unsafe_wrap(Array{Float64, 1}, X, 6, own=false)
    y = unsafe_wrap(Array{Float64, 2}, Y, (6, 6), own=false)
    end

    # Calculations
    # *************************************************************************************
    @inbounds begin
    E = a1
    ν = a2
    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))
    for i = 1:3
        y[i,i] = 2 * μ + λ
        y[i+3,i+3] = μ
    end # i
    y[1,2] = λ
    y[1,3] = λ
    y[2,1] = λ
    y[2,3] = λ
    y[3,1] = λ
    y[3,2] = λ

    x[1] = (2 * μ + λ) * b1[1] + λ * b1[2] + λ * b1[3]
    x[2] = (2 * μ + λ) * b1[2] + λ * b1[1] + λ * b1[3]
    x[3] = (2 * μ + λ) * b1[3] + λ * b1[2] + λ * b1[1]
    x[4] = 2 * μ * b1[4] 
    x[5] = 2 * μ * b1[5] 
    x[6] = 2 * μ * b1[6] 
    end

    return nothing
end # foo1
foo1_Tensors
using Tensors
function foo1_Tensors(
    A::Ptr{Cdouble}, B::Ptr{Cdouble}, C::Ptr{Cdouble},          # input pointers
    X::Ptr{Cdouble}, Y::Ptr{Cdouble},                           # output pointers
    N::Ptr{Cint}
)
    # Loading parameters
    # *************************************************************************************
    @inbounds begin
    n = unsafe_load(N, 1)

    a1 = unsafe_load(A, 1)
    a2 = unsafe_load(A, 2)

    b = unsafe_wrap(Array{Float64, 2}, B, (6, 2), own=false)
    b1 = @view b[:, 1]
    b2 = @view b[:, 2]

    c = unsafe_wrap(Array{Float64, 3}, C, (3, 3, 4), own=false)
    c1 = @view c[:, :, 1]
    c2 = @view c[:, :, 2]
    c3 = @view c[:, :, 3]
    c4 = @view c[:, :, 4]

    x = unsafe_wrap(Array{Float64, 1}, X, 6, own=false)
    y = unsafe_wrap(Array{Float64, 2}, Y, (6, 6), own=false)
    end

    # Calculations
    # *************************************************************************************
    I = one(SymmetricTensor{2,3})
    Iˢʸᵐ = one(SymmetricTensor{4,3})
    ε = fromvoigt(SymmetricTensor{2,3}, b1, offdiagscale=2.0)
    E = a1
    ν = a2

    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))
    
    Cc = λ * I ⊗ I + 2 * μ * Iˢʸᵐ
    σ = Cc ⊡ ε

    @inbounds begin
    x[:] = tovoigt(σ)
    y[:,:] = tovoigt(Cc)
    end

    return nothing
end # foo1_Tensors
foo1_jl_Tensors
function foo1_jl_Tensors(a, b, c, x, y, n)
    # Loading parameters
    # *************************************************************************************
    @inbounds begin
    a1 = @inbounds a[1]
    a2 = @inbounds a[2]

    b1 = @view b[:, 1]
    b2 = @view b[:, 2]

    c1 = @view c[:, :, 1]
    c2 = @view c[:, :, 2]
    c3 = @view c[:, :, 3]
    c4 = @view c[:, :, 4]
    end

    # Calculations
    # *************************************************************************************
    I = one(SymmetricTensor{2,3})
    Iˢʸᵐ = one(SymmetricTensor{4,3})
    ε = fromvoigt(SymmetricTensor{2,3}, b1, offdiagscale=2.0)
    E = a1
    ν = a2

    λ = E * ν / ((1 + ν) * (1 - 2 * ν))
    μ = E / (2 * (1 + ν))
    
    Cc = λ * I ⊗ I + 2 * μ * Iˢʸᵐ
    σ = Cc ⊡ ε

    @inbounds begin
    x[:] = tovoigt(σ)
    y[:,:] = tovoigt(Cc)
    end

    return nothing
end # foo1_jl_Tensors

I made small benchmark inside foo1 and foo1_Tensors to know the time it takes to unpack the pointers and in the calculations. These were the results:

foo1:
  unpacking pointers: 151.547 ns (4 allocations: 320 bytes)
  calculations: 21.737 ns (0 allocations: 0 bytes)

foo1_Tensors:
  unpacking pointers: 150.472 ns (4 allocations: 320 bytes)
  calculations: 414.740 ns (2 allocations: 496 bytes)

By using the package Tensors.jl can generate an overhead?
Or it is just how the functions are implemented?

I hope it is not confusing. Thank you for all your help.

You could also use @code_native and compare the assembly generated from Julia to the assembly generated by the Fortran compiler.

Regarding Tensors.jl, it might be better to create a new topic to ask specifically about that. I’ll ping @fredrikekre and @kristoffer.carlsson for you though since they are the major committers to that package.

1 Like

Just a comment here, even in a module with precompile statements, julia is not doing ahead of time compilation. All that gets cached is julia’s intermediate representation. That IR still needs to be typed and optimized, and then passed to LLVM for compilation to native code, which does not get cached.

To eliminate compilation overhead between julia sessions, you’ll need to use GitHub - JuliaLang/PackageCompiler.jl: Compile your Julia Package which will let you bake full compilation results of your precompile statements into a sysimage.

5 Likes