I am a new Julia user. When doing matrix operations, there is some confusion about the performance of the code.
Format 1
C=0
A=[2 4 1; 4 1 4; 4 2 3]
@time for i=1:10000000
    global A=A/i
    A=A+A'
    global C=C+sum(A)
end
C
3.792766 seconds (60.00 M allocations: 3.576 GiB, 3.94% gc time)
It takes 3.79s.
Format 2
By referring to documents, the global variables is avoided:
A=[2.0 4 1; 4 1 4; 4 2 3]
function loop_over_global(A::Array{Float64,2})
    c=0.0
    for i=1:10000000
        A=A/i
        A=A.+A'
        c+=sum(A)
    end
    return c
end
@time loop_over_global(A)
1.594670 seconds (20.04 M allocations: 2.982 GiB, 10.59% gc time)
It takes 1.59s.
Format 3
By replacing the default matrix operations into loop operations:
A=[2.0 4 1; 4 1 4; 4 2 3]
function loop_over_global(A::Array{Float64,2})
    c=0.0
    for i=1:10^7
        for k=1:3
            for j=1:3
                A[j,k]=A[j,k]/i
            end
        end
        
        for k=1:3
            for j=k:3
                A[j,k]=A[j,k]+A'[j,k]
            end
        end
        
        for k=1:3
            for j=1:k-1
                A[j,k]=A[k,j]
            end
        end
        
        c+=sum(A)
    end
    return c
end
@time loop_over_global(A)
0.442143 seconds (82.71 k allocations: 4.293 MiB)
It takes 0.44s.
Fortran code
    program main
    integer(kind=8):: ip,i,j
    real(kind=8):: A(3,3),c,t1,t2
    c=0.0
    A(1,1) = 2; A(1,2) = 4; A(1,3) = 1;
    A(2,1) = 4; A(2,2) = 1; A(2,3) = 4;
    A(3,1) = 4; A(3,2) = 2; A(3,3) = 3;
    call cpu_time(t1)
    do ip=1,10000000
        A=A/ip
        A=A+transpose(A)
        do i=1,3
            do j=1,3
                c=c+A(j,i)
            enddo
        enddo
    enddo
    call cpu_time(t2)
    write(*,*) c,t2-t1
    end
Uing IVF2011. It takes 0.25s.
I am very confused about the performance difference between format 2 and format 3. And can Julia code be further improved?