Hello all, I am new to Julia and have been making progress with implementing its performance utilities. Generally, I am able to get some results or an understanding of how things are going, but not in this case.

The following code:

```
@btime circshift!(rickshift, rick0, shift[iz,iy,ix,it])
@btime mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
@btime Threads.@threads for i in 1:nT
Dd= view(dd, :, i)
Pd= view(pd, :, i)
circshift!(Dd, Pd, Tgrid[i])
end
```

returns (I am using 4 threads throughout)

```
200.706 ns (2 allocations: 48 bytes)
8.112 μs (2 allocations: 928 bytes)
29.826 μs (520 allocations: 22.14 KiB)
```

EDIT: I have defined rick0 as a sparse vector earlier in the code, and `rickshift= similar(rick0)`

, `shift`

is a 4-dimensional array of size `nz,ny,nx,nT`

containing `Float32`

values, `dd`

is defined as `dd= view(d, :, ir, :)`

as is also in the next snippet, `pd= zeros(nt, nT)`

, `Tshift= -49:50`

which implies that the following function

```
#size(d) = nt, nr, nT
#size(m) = nz,ny,nx,nT
function Gnew!(d, m)
for ir in 1:Nr
for ix in 1:nx, iy in 1:ny, iz in 1:nz
circshift!(rickshift, rick0, shift[iz,iy,ix,ir])
mul_alt2!(pd, m[iz,iy,ix, :], rickshift)
Threads.@threads for i in 1:nT
Dd= view(dd, :, i)
Pd= view(pd, :, i)
circshift!(Dd, Pd, Tshift[i])
end
end
end
end
```

would run in some 20 minutes for nx=ny=nz=50, nT= 100, Nr= 250 (39e-6 x50x50x50x250/60= 20.468)

but when I run it using the following:

```
D= zeros(nt, nr, nT)
m_init= randn(nx,ny,nz,nT)
t1= time()
Gnew!(D, m_init)
time()- t1
```

the output is 4375 secs

And `mul_alt2!(...)`

is a function, to give matrix output of 2 vectors, one of them being sparse vector, which are not defined as matrices, defined as

```
function mul_alt2!(C::Matrix, X::Vector, A::SparseVector)
@inbounds for i in A.nzind
cc=view(C,i,:)
BLAS.axpy!(A[i], X, cc)
end
end
```

Since the A and X in the above function definition are vectors and not matrices, I did not find a relevant `BLAS`

function or any relevant efficient function that does not require me to reshape the vectors into arrays, since that step again is not efficient.

I hope I have posted enough information. In case I haven’t, please let me know. Also if you have some suggestions, do share them. Thanks for your time!