I’m slowly learning some of the tricks that can make `julia`

code quicker. Coming from `python`

I still find it strange to write a loop instead of some vectorised but it seems that sometimes that the loop is the quicker option.

I’ve found this example where I’m just calculating the norm of vector when passing in a 3D array of many vectors, i.e. `cvec_1 = rand(50000,3)`

.

I started with a broadcasted function and this was realtively quick:

```
function normalize_by_row(arr)
arr ./ sqrt.(sum(arr.^2, dims=2))
end
@btime nrm = normalize_by_row($cvec_1);
898.900 μs (10 allocations: 2.94 MiB)
```

But I also wanted to go check the looped possibilities so I have tried a few different versions using what I’ve picked up about `@views`

and `@inbounds`

:

```
function normalize_by_row_v2(arr)
norms = similar(arr)
for i in axes(arr,1)
@views norms[i,:] = arr[i,:] / sqrt(sum(arr[i,:].^2))
end
return norms
end
function normalize_by_row_v3(arr)
norms = Vector{Float64}(undef, size(arr)[1])
for i in axes(arr,1)
@views norms[i] = sqrt(sum(arr[i,:].^2))
end
return arr ./ norms
end
function normalize_by_row_v4(arr)
norms = Vector{Float64}(undef, size(arr)[1])
temp = arr.^2
for i in axes(arr,1)
@views norms[i] = sqrt(sum(temp[i,:]))
end
return arr ./ norms
end
function normalize_by_row_v5(arr)
norms = Vector{Float64}(undef, size(arr)[1])
temp = arr.^2
for i in axes(arr,1)
@inbounds @views norms[i] = sqrt(sum(temp[i,:]))
end
return arr ./ norms
end
```

I guess I’ve had varying degrees of success, with the first one being both slow and memory hungry and the final looped version being nearly 20% faster than the broadcasted one:

```
julia> @btime nrm2 = normalize_by_row_v2($cvec_1);
4.016 ms (110166 allocations: 13.03 MiB)
julia> @btime nrm3 = normalize_by_row_v3($cvec_1);
1.999 ms (55086 allocations: 7.56 MiB)
julia> @btime nrm4 = normalize_by_row_v4($cvec_1);
782.800 μs (6 allocations: 2.94 MiB)
julia> @btime nrm5 = normalize_by_row_v5($cvec_1);
748.200 μs (6 allocations: 2.94 MiB)
```

The final one seems to be pretty good, but I’m wondering are there any further improvements that could be made to such a function?

I aslo thought I could use `LoopVectorization`

and `@turbo`

in v5 of the function, but it doesn’t really work and throws an error:

```
ERROR: ArgumentError: invalid index: VectorizationBase.MM{4, 1, Int64}<1, 2, 3, 4> of type VectorizationBase.MM{4, 1, Int64}
```

So can I get any faster? Is there some other important `julia`

concept that I’ve missed?

Thanks in advance!