Julia is Eight times Slower than Go on Numerical Scheme

Here is a run of the Julia code

conjulia -- Energy Conserving IRK Scheme

y(0)=[1.0, 1.0, 1.0]

    p       n              h         E(100)        dE(100)          error
    7     128  7.8125000e-01  3.0000000e+00  0.0000000e+00  3.8059634e-02
    8     256  3.9062500e-01  3.0000000e+00 -3.5527137e-15  7.3773381e-03
    9     512  1.9531250e-01  3.0000000e+00 -2.2204460e-15  4.9289848e-04
   10    1024  9.7656250e-02  3.0000000e+00  5.3290705e-15  3.2058762e-05
   11    2048  4.8828125e-02  3.0000000e+00  4.4408921e-16  2.0240981e-06
   12    4096  2.4414062e-02  3.0000000e+00  7.1054274e-15  1.2682855e-07
   13    8192  1.2207031e-02  3.0000000e+00 -1.6431301e-14  7.9318318e-09
   14   16384  6.1035156e-03  3.0000000e+00  6.4837025e-14  4.9582805e-10
   15   32768  3.0517578e-03  3.0000000e+00  2.0428104e-14  3.0986253e-11
   16   65536  1.5258789e-03  3.0000000e+00 -1.0658141e-14  1.9332709e-12

y(100)=[-0.11882343279003697, 0.20699682299721942, -1.7155271222248394]

Total execution time 1166.090384056 seconds.

real    21m38.138s
user    20m51.230s
sys 0m15.350s

and here is a run of the Go program

congo -- Energy Conserving IRK Scheme

y(0)=[1, 1, 1]

    p       n              h         E(100)        dE(100)          error
    7     128  7.8125000e-01  3.0000000e+00  0.0000000e+00  3.8059634e-02
    8     256  3.9062500e-01  3.0000000e+00 -3.5527137e-15  7.3773381e-03
    9     512  1.9531250e-01  3.0000000e+00 -4.4408921e-16  4.9289848e-04
   10    1024  9.7656250e-02  3.0000000e+00  8.4376950e-15  3.2058762e-05
   11    2048  4.8828125e-02  3.0000000e+00  3.9968029e-15  2.0240981e-06
   12    4096  2.4414062e-02  3.0000000e+00  5.7731597e-15  1.2682855e-07
   13    8192  1.2207031e-02  3.0000000e+00 -1.6431301e-14  7.9318324e-09
   14   16384  6.1035156e-03  3.0000000e+00  5.9063865e-14  4.9582719e-10
   15   32768  3.0517578e-03  3.0000000e+00  1.8651747e-14  3.0986469e-11
   16   65536  1.5258789e-03  3.0000000e+00 -1.0658141e-14  1.9335909e-12

y(100)=[-0.11882343279003703, 0.20699682299721947, -1.7155271222248394]

Total execution time 148.85671734809875 seconds.

real    2m29.261s
user    2m26.410s
sys     0m1.820s

For reference the source codes for the two programs are

#  conserve.jl -- Energy Conserving IRK Scheme

using Printf, LinearAlgebra

s3o6=sqrt(3)/6
a=[1/4 1/4-s3o6; 1/4+s3o6 1/4]
b=[1/2,1/2]
c=[1/2-s3o6,1/2+s3o6]

function euler(t,y,h)
    return y+h*fF(t,y)
end

function rk2(t,y,h)
    k1=fF(t,y)
    k2=fF(t,y+h*2/3*k1)
    return y+h*(1/4*k1+3/4*k2)
end

function fF(t,y)
    F=zeros(3)
F[1] = sin(t)*y[2]*y[3] - y[1]*y[2]*y[3]
F[2] = (y[1]*y[3])/20 - sin(t)*y[1]*y[3]
F[3] = -(y[1]*y[2])/20 + y[1]^2*y[2]
    return F
end

function fG(t,y,h,k)
    G=zeros(6)
G[1] = k[1] - sin(t + h*c[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3]) + (h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
G[2] = k[2] - sin(t + h*c[1])*(-(h*(a[1, 1]*k[1] + a[1, 2]*k[4])) - y[1])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3]) - ((h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3]))/20
G[3] = k[3] + ((h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2]))/20 - (h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])^2*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])
G[4] = k[4] - sin(t + h*c[2])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3]) + (h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
G[5] = k[5] - sin(t + h*c[2])*(-(h*(a[2, 1]*k[1] + a[2, 2]*k[4])) - y[1])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3]) - ((h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3]))/20
G[6] = k[6] + ((h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2]))/20 - (h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])^2*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])
    return G
end

function fDG(t,y,h,k)
    DG=zeros(6,6)
DG[1, 1] = 1 + h*a[1, 1]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[1, 2] = -(sin(t + h*c[1])*h*a[1, 1]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])) + h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[1, 3] = -(sin(t + h*c[1])*h*a[1, 1]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])) + h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])
DG[1, 4] = h*a[1, 2]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[1, 5] = -(sin(t + h*c[1])*h*a[1, 2]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])) + h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[1, 6] = -(sin(t + h*c[1])*h*a[1, 2]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])) + h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])
DG[2, 1] = -(h*a[1, 1]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3]))/20 + sin(t + h*c[1])*h*a[1, 1]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[2, 2] = 1
DG[2, 3] = -(sin(t + h*c[1])*h*a[1, 1]*(-(h*(a[1, 1]*k[1] + a[1, 2]*k[4])) - y[1])) - (h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1]))/20
DG[2, 4] = -(h*a[1, 2]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3]))/20 + sin(t + h*c[1])*h*a[1, 2]*(h*(a[1, 1]*k[3] + a[1, 2]*k[6]) + y[3])
DG[2, 5] = 0
DG[2, 6] = -(sin(t + h*c[1])*h*a[1, 2]*(-(h*(a[1, 1]*k[1] + a[1, 2]*k[4])) - y[1])) - (h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1]))/20
DG[3, 1] = (h*a[1, 1]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2]))/20 - 2*h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])
DG[3, 2] = (h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1]))/20 - h*a[1, 1]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])^2
DG[3, 3] = 1
DG[3, 4] = (h*a[1, 2]*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2]))/20 - 2*h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])*(h*(a[1, 1]*k[2] + a[1, 2]*k[5]) + y[2])
DG[3, 5] = (h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1]))/20 - h*a[1, 2]*(h*(a[1, 1]*k[1] + a[1, 2]*k[4]) + y[1])^2
DG[3, 6] = 0
DG[4, 1] = h*a[2, 1]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[4, 2] = -(sin(t + h*c[2])*h*a[2, 1]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])) + h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[4, 3] = -(sin(t + h*c[2])*h*a[2, 1]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])) + h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])
DG[4, 4] = 1 + h*a[2, 2]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[4, 5] = -(sin(t + h*c[2])*h*a[2, 2]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])) + h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[4, 6] = -(sin(t + h*c[2])*h*a[2, 2]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])) + h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])
DG[5, 1] = -(h*a[2, 1]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3]))/20 + sin(t + h*c[2])*h*a[2, 1]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[5, 2] = 0
DG[5, 3] = -(sin(t + h*c[2])*h*a[2, 1]*(-(h*(a[2, 1]*k[1] + a[2, 2]*k[4])) - y[1])) - (h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1]))/20
DG[5, 4] = -(h*a[2, 2]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3]))/20 + sin(t + h*c[2])*h*a[2, 2]*(h*(a[2, 1]*k[3] + a[2, 2]*k[6]) + y[3])
DG[5, 5] = 1
DG[5, 6] = -(sin(t + h*c[2])*h*a[2, 2]*(-(h*(a[2, 1]*k[1] + a[2, 2]*k[4])) - y[1])) - (h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1]))/20
DG[6, 1] = (h*a[2, 1]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2]))/20 - 2*h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])
DG[6, 2] = (h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1]))/20 - h*a[2, 1]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])^2
DG[6, 3] = 0
DG[6, 4] = (h*a[2, 2]*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2]))/20 - 2*h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])*(h*(a[2, 1]*k[2] + a[2, 2]*k[5]) + y[2])
DG[6, 5] = (h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1]))/20 - h*a[2, 2]*(h*(a[2, 1]*k[1] + a[2, 2]*k[4]) + y[1])^2
DG[6, 6] = 1
    return DG
end

function irk(t,y,h)
    k1=fF(t+h*c[1],rk2(t,y,h*c[1]))
    k2=fF(t+h*c[2],rk2(t,y,h*c[2]))
    Ki=[k1;k2]
    lastone=false
    for i=1:10
        Gi=fG(t,y,h,Ki)
        DGi=fDG(t,y,h,Ki)
        DK=-DGi\Gi
        Ki+=DK
        if lastone
            break
        end
        if norm(DK)<=norm(Ki)*10e-10
            lastone=true
        end
    end
    k1=Ki[1:3]
    k2=Ki[4:6]
    return y+h*(b[1]*k1+b[2]*k2)
end

function solve(t0,y0,h,n)
    yn=copy(y0)
    for j=1:n
        tn=t0+(j-1)*h
        yn=irk(tn,yn,h)
    end
    return yn
end

function main(args)
    @printf("conjulia -- Energy Conserving IRK Scheme\n\n")
    t0=0; T=100
    y0=[1.0,1,1]
    if length(args)==3
        for j=1:3
            y0[j]=parse(Float64,args[j])
        end
    end
    println("y($t0)=$y0\n")
    einit=y0'*y0
    yold=zeros(3)
    @printf("%5s %7s %14s %14s %14s %14s\n",
        "p","n","h","E(100)","dE(100)","error")
    yn=copy(y0)
    n=64
    for p=6:16
        h=T/n
        yn=solve(t0,y0,h,n)
        if p>6
            @printf("%5d %7d %14.7e %14.7e %14.7e %14.7e\n",
            p,n,h,yn'*yn,yn'*yn-einit,norm(yold-yn))
        end
        flush(stdout)
        yold=copy(yn)
        n*=2
    end
    println("\ny($T)=$yn")
end

tsec=@elapsed main(ARGS)
println("\nTotal execution time ",tsec," seconds.")

and

/*  congo.go -- Energy Conserving IRK Scheme */

package main
import ("fmt"; "os"; "time"; "math"; "strconv";
    "unsafe"; . "gonum.org/v1/gonum/mat")

var (s3o6 float64; a [2][2]float64; b,c [2]float64)
func doinit(){
    s3o6=math.Sqrt(3.)/6
    a=[2][2]float64{{1./4, 1./4-s3o6}, {1./4+s3o6, 1./4}}
    b=[2]float64{1./2,1./2}
    c=[2]float64{1./2-s3o6,1./2+s3o6}
}

var tictime float64
func tic() {
    now:=time.Now()
    tictime=float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc() float64 {
    now:=time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tictime
}

func sin(t float64) float64 {
    return math.Sin(t)
}

func prvec(yn *VecDense) {
    y:=(*yn).RawVector().Data
    fmt.Printf("[")
    for j:=0;j<len(y);j++ {
        fmt.Printf("%g%s",y[j],
            func() string{if j<len(y)-1 { return ", "
                } else { return "]" }}())
    }
}

func fF(t float64,yv *VecDense) *VecDense {
    var F [3]float64
    y:=(*yv).RawVector().Data
F[0] = sin(t)*y[1]*y[2] - y[0]*y[1]*y[2]
F[1] = (y[0]*y[2])/20 - sin(t)*y[0]*y[2]
F[2] = -(y[0]*y[1])/20 + math.Pow(y[0],2)*y[1]
    return NewVecDense(3,F[:])
}

func fG(t float64,yv *VecDense,h float64,kv *VecDense) *VecDense {
    var G [6]float64
    y:=(*yv).RawVector().Data
    k:=(*kv).RawVector().Data
G[0] = k[0] - sin(t + h*c[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2]) + (h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
G[1] = k[1] - sin(t + h*c[0])*(-(h*(a[0][0]*k[0] + a[0][1]*k[3])) - y[0])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2]) - ((h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2]))/20
G[2] = k[2] + ((h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1]))/20 - math.Pow(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0],2)*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])
G[3] = k[3] - sin(t + h*c[1])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2]) + (h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
G[4] = k[4] - sin(t + h*c[1])*(-(h*(a[1][0]*k[0] + a[1][1]*k[3])) - y[0])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2]) - ((h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2]))/20
G[5] = k[5] + ((h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1]))/20 - math.Pow(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0],2)*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])
    return NewVecDense(6,G[:])
}

func fDG(t float64,yv *VecDense,h float64,kv *VecDense) *Dense {
    var DG [6][6]float64
    y:=(*yv).RawVector().Data
    k:=(*kv).RawVector().Data
DG[0][0] = 1 + h*a[0][0]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[0][1] = -(sin(t + h*c[0])*h*a[0][0]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])) + h*a[0][0]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[0][2] = -(sin(t + h*c[0])*h*a[0][0]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])) + h*a[0][0]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])
DG[0][3] = h*a[0][1]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[0][4] = -(sin(t + h*c[0])*h*a[0][1]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])) + h*a[0][1]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[0][5] = -(sin(t + h*c[0])*h*a[0][1]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])) + h*a[0][1]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])
DG[1][0] = -(h*a[0][0]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2]))/20 + sin(t + h*c[0])*h*a[0][0]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[1][1] = 1
DG[1][2] = -(sin(t + h*c[0])*h*a[0][0]*(-(h*(a[0][0]*k[0] + a[0][1]*k[3])) - y[0])) - (h*a[0][0]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0]))/20
DG[1][3] = -(h*a[0][1]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2]))/20 + sin(t + h*c[0])*h*a[0][1]*(h*(a[0][0]*k[2] + a[0][1]*k[5]) + y[2])
DG[1][4] = 0
DG[1][5] = -(sin(t + h*c[0])*h*a[0][1]*(-(h*(a[0][0]*k[0] + a[0][1]*k[3])) - y[0])) - (h*a[0][1]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0]))/20
DG[2][0] = (h*a[0][0]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1]))/20 - 2*h*a[0][0]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])
DG[2][1] = (h*a[0][0]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0]))/20 - h*a[0][0]*math.Pow(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0],2)
DG[2][2] = 1
DG[2][3] = (h*a[0][1]*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1]))/20 - 2*h*a[0][1]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0])*(h*(a[0][0]*k[1] + a[0][1]*k[4]) + y[1])
DG[2][4] = (h*a[0][1]*(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0]))/20 - h*a[0][1]*math.Pow(h*(a[0][0]*k[0] + a[0][1]*k[3]) + y[0],2)
DG[2][5] = 0
DG[3][0] = h*a[1][0]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[3][1] = -(sin(t + h*c[1])*h*a[1][0]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])) + h*a[1][0]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[3][2] = -(sin(t + h*c[1])*h*a[1][0]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])) + h*a[1][0]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])
DG[3][3] = 1 + h*a[1][1]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[3][4] = -(sin(t + h*c[1])*h*a[1][1]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])) + h*a[1][1]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[3][5] = -(sin(t + h*c[1])*h*a[1][1]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])) + h*a[1][1]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])
DG[4][0] = -(h*a[1][0]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2]))/20 + sin(t + h*c[1])*h*a[1][0]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[4][1] = 0
DG[4][2] = -(sin(t + h*c[1])*h*a[1][0]*(-(h*(a[1][0]*k[0] + a[1][1]*k[3])) - y[0])) - (h*a[1][0]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0]))/20
DG[4][3] = -(h*a[1][1]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2]))/20 + sin(t + h*c[1])*h*a[1][1]*(h*(a[1][0]*k[2] + a[1][1]*k[5]) + y[2])
DG[4][4] = 1
DG[4][5] = -(sin(t + h*c[1])*h*a[1][1]*(-(h*(a[1][0]*k[0] + a[1][1]*k[3])) - y[0])) - (h*a[1][1]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0]))/20
DG[5][0] = (h*a[1][0]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1]))/20 - 2*h*a[1][0]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])
DG[5][1] = (h*a[1][0]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0]))/20 - h*a[1][0]*math.Pow(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0],2)
DG[5][2] = 0
DG[5][3] = (h*a[1][1]*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1]))/20 - 2*h*a[1][1]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0])*(h*(a[1][0]*k[1] + a[1][1]*k[4]) + y[1])
DG[5][4] = (h*a[1][1]*(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0]))/20 - h*a[1][1]*math.Pow(h*(a[1][0]*k[0] + a[1][1]*k[3]) + y[0],2)
DG[5][5] = 1
    return NewDense(6,6,(*[36]float64)(unsafe.Pointer(&DG[0][0]))[:])
}

func euler(t float64,y *VecDense,h float64) *VecDense {
    var w VecDense
    w.AddScaledVec(y,h,fF(t,y))
    return &w
}

func rk2(t float64,y *VecDense,h float64) *VecDense {
    k1:=fF(t,y)
    var xi1,xi2 VecDense
    xi1.AddScaledVec(y,h*2/3,k1)
    k2:=fF(t,&xi1)
    xi1.AddScaledVec(k1,3.0,k2)
    xi2.AddScaledVec(y,h/4,&xi1)
    return &xi2
}

func irk(t float64,y *VecDense,h float64) *VecDense {
    k1:=fF(t+h*c[0],rk2(t,y,h*c[0]))
    k2:=fF(t+h*c[1],rk2(t,y,h*c[1]))
    Ki:=NewVecDense(6,nil)
    copy((*Ki).RawVector().Data[0:3],(*k1).RawVector().Data)
    copy((*Ki).RawVector().Data[3:6],(*k2).RawVector().Data)
    lastone:=false
    for i:=0;i<10;i++ {
        Gi:=fG(t,y,h,Ki)
        DGi:=fDG(t,y,h,Ki)
        var DK VecDense
        DK.SolveVec(DGi,Gi)
        Ki.SubVec(Ki,&DK)
        if lastone {
            break
        }
        if Norm(&DK,2)<=Norm(Ki,2)*10e-10 {
            lastone=true
        }
    }
    copy((*k1).RawVector().Data,(*Ki).RawVector().Data[0:3])
    copy((*k2).RawVector().Data,(*Ki).RawVector().Data[3:6])
    var xi1,xi2 VecDense
    xi1.AddScaledVec(k1,b[1]/b[0],k2)
    xi2.AddScaledVec(y,h*b[0],&xi1)
    return &xi2
}

func solve(t0 float64,y0 *VecDense,h float64,n int) *VecDense{
    ycopy:=*y0; yn:=&ycopy
    for j:=0;j<n;j++ {
        tn:=t0+float64(j)*h
        yn=irk(tn,yn,h)
    }
    return yn
}

func main() {
    tic()
    doinit()
    fmt.Printf("congo -- Energy Conserving IRK Scheme\n\n")
    t0:=0.0; T:=100.0
    y0v:=[]float64{1,1,1}
    if len(os.Args)==4 {
        for j:=0;j<3;j++ {
            y0v[j],_=strconv.ParseFloat(os.Args[j+1],64)
        }
    }
    y0:=NewVecDense(3,y0v)
    fmt.Printf("y(%g)=",0.0); prvec(y0); fmt.Printf("\n\n")
    einit:=Dot(y0,y0)
    yold:=NewVecDense(3,nil)
    fmt.Printf("%5s %7s %14s %14s %14s %14s\n",
        "p","n","h","E(100)","dE(100)","error")
    var yn *VecDense
    n:=64
    for p:=6;p<=16;p++ {
        h:=T/float64(n)
        yn=solve(t0,y0,h,n)
        var er VecDense
        er.SubVec(yold,yn)
        ner:=Norm(&er,2)
        if p>6 {
            fmt.Printf("%5d %7d %14.7e %14.7e %14.7e %14.7e\n",
            p,n,h,Dot(yn,yn),Dot(yn,yn)-einit,ner)
        }
        *yold=*yn
        n*=2
    }
    fmt.Printf("\ny(%g)=",T); prvec(yn); fmt.Printf("\n")
    fmt.Printf("\nTotal execution time %g seconds.\n",toc())
    os.Exit(0)
}

I had expected the run times to be similar or even that Julia would be faster than Go. However, Julia is eight times slower.

The above timings were made with 32-bit ARM and Julia 1.6.0. A similar performance differential appears on 64-bit ARM and 64-bit AMD architectures.

Does anyone know what’s going on?

I would recommend a careful read of the performance tips, in particular the section on global variables. s3o6, a, b, and c should be const to avoid type instability. You may also want to use the @inbounds annotation to reduce bounds-checking overhead, or StaticArrays.jl to speed up operations on small (~ <100 elements) arrays.

10 Likes

I suspect you are using static arrays in go but heap allocated arrays in Julia. Try using StaticArrays.jl and you’ll probably speed the Julia version up an order of magnitude

8 Likes

I went ahead and applied my (minimal) suggestions - code is here.

Before:

y(100)=[-0.11882343279003697, 0.20699682299721942, -1.7155271222248394]

Total execution time 25.8274428 seconds.

After:

y(100)=[-0.11882343279003697, 0.20699682299721942, -1.7155271222248394]

Total execution time 0.4131553 seconds.
30 Likes

Just declaring a,b,c and d as global const reduced the time by a factor of 10 in my machine (19s vs 1.9s).

16 Likes

You’ll also want to rethink about using that scheme in practice: that form of iteration on that method is going to be slow. To 1e-10 it’s much better to use something higher order like GitHub - SciML/IRKGaussLegendre.jl: Implicit Runge-Kutta Gauss-Legendre 16th order (Julia)

5 Likes

The present time-stepping scheme is the 4th order Gauss-Legendre. The 1e-10 tolerance is not the desired accuracy of the solution but used for the Newton iterations which solve the implicit scheme to machine precision in order to exactly preserve the energy of the conservative dynamical system being integrated.

The goal is to illustrate how one can obtain an engineering tolerance approximation that still preserves energy without needing to compute a high precision solution. In particular, all approximations in the time resolution study preserve energy to within rounding error even the inaccurate ones.

However, I’m admittedly a beginner with IRK methods.

I have looked through your code for the 8-stage IRK scheme. Since I’m integrating a system of three ODE’s my understanding is that this would lead to 24 variables that need to be implicitly solved to machine precision at each time step. The system isn’t particularly stiff, but the idea is to relax accuracy requirements as much as possible while continuing to exactly preserve energy.

I’ll see what I can do with the code you linked. These mathematical pointers and ideas are greatly appreciated. Thanks!

That’s a simple speedup. Thanks for testing this!

1 Like

It looks like I didn’t fully understand the performance implications of const and other things to the JIT. As I had just watched the conference talk

by Erik Engheim, I may have been lulled into thinking everything would just work out without additional “annotations” which seems not the case.

Again, I appreciate all the helpful comments and friendly community. It looks like I’ll soon be able to know Julia as my friend. Thanks!

2 Likes

I’d definitely suggest reading Performance Tips · The Julia Language if you want to learn more about high-performance Julia. The very first rule is about avoiding (non-constant) global variables, and I think that covers 90% of new user performance issues that I’ve seen. Fortunately, it’s easy to fix and easy to avoid once you know to pay attention.

9 Likes

You are right that the global const is the first thing in performance tips. I think the reason it trips people up so frequently is because that mistake is so simple that they immediately focus on the other performance tips. It’s also astonishing how much of a difference it makes compared to other languages.

Suppose I wanted to run my program checking a parameterized family of different IRK methods for a certain aspect of performance. In this case, I might want to change the global variables a, b and c in an outside loop and then perform the same test for each choice. I think this means a, b and c can no longer be const globals (since I want to keep changing them in the outside loop).

What would be the work around? Do I allocate a, b and c locally and then pass them in as arguments to all the functions? Could I retain performance by using nested function definitions (all inside my main function) or closures?

Yes, that (you may use a struct or a tuple to organize things). You could also use closures, or functors. Here many alternatives were dicussed: Define data-dependent function, various ways

1 Like

You might want to look into alternative ManifoldProjection schemes and such for preserving energy. Doing one Newton iteration each time might still be more than you need. Also, at that level it can be quasi-Newton where you save pre-factorized DGi. Anyways, there’s a ton you can do to the algorithm here so this in general isn’t an efficient way to compute this, and that’s disregarding the fact that the implementation is allocating a lot and such.

1 Like

You can change the values of the contents of a, b, and c to have a family of parameters. It doesn’t matter that they are defined as constants.

In that link you can see:

Note that “constant-ness” does not extend into mutable containers; only the association between a variable and its value is constant. If x is an array or dictionary (for example) you can still modify, add, or remove elements.

The important thing about defining a container (as an array) as a constant is that the compiler can be certain that the type of the variable will not change.

In your case you have global const b=[1/2,1/2]. This makes the compiler aware that the variable b will always be a container of type typeof(b) = Vector{Float64}. However the contents of the container can be changed:

julia> global const b = [1/2,1/2]
2-element Vector{Float64}:
 0.5
 0.5

julia> typeof(b)
Vector{Float64} (alias for Array{Float64, 1})

julia> b[1] = 1/3
0.3333333333333333

julia> b
2-element Vector{Float64}:
 0.3333333333333333
 0.5

So you can indeed change the contents of a and b without problems, as long as you put objects of the same type in there, and you will not lose performance.

However, because b will always be a container of objects of type Float64 (because you defined it as a constant), you cannot put a String into it:

julia> b[2] = "tata"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] setindex!(A::Vector{Float64}, x::String, i1::Int64)
   @ Base ./array.jl:839
 [2] top-level scope
   @ REPL[32]:1

This constant-ness of the type of the object is what allows the compiler to produce efficient code.

4 Likes

Woohoo! Being able to change a, b and c even though declared as global constants means I can have my cake and eat it too. Thanks, this just saved me from all of the workarounds I was considering!

1 Like

I think you are right about the method and how it could be improved. This code was the result of a 1-hour lecture demonstration for a numerical analysis course (including the crazy CAS stuff in Mathematica to create the functions fG and fDG).

Since this is only a 6x6 matrix, it’s not clear to me whether switching to a quasi-Newton’s method would be more efficient, especially as the goal is to solve for the k1 and k2 in the IRK step to machine precision in order to preserve energy.

Do you have a good reference for manifold-projection schemes? My impression is that methods like IRK Gauss-Legendre also work well when a dissipative perturbation is added to the dynamics. Again thanks for the ideas.

My main concern was the programming language, but I’ve learned much more!

Damn right!

Most of the implicit IVP codes I know about compute and factor Jacobians only when the number of Newton iterations per time step exceeds 3. I think the number 3 came from Gear’s code decades ago. You adjust time step and order for stability and accuracy and fiddle with the nonlinear solver to avoid Jacobian work.

My recollection is that I need about 3 to 4 Newton iterations to solve k1 and k2 to machine precision. If I don’t, then energy is not conserved to machine precision, which was the point in the first place.

In particular, the tradeoff in the precision of the computation of k1 and k2 relates not just to the precision of the resulting approximation y, but also to the precision with which energy is conserved. From what I understand, by using a conservative IRK scheme, these two aspects of the approximate solution can be decoupled.

Just be careful because while that can be efficient, it makes the code harder to debug and less portable, since the behaviour of your functions will depend on the global variables. Parameters, closures, etc, are safer in that way.

5 Likes

Could it be that your time steps are too long and/or your predictor is having trouble? Four Newtons/time step is not bad if there is a reason.

In many, but not all, cases one would reduce the time step if the nonlinear solver was taking too many steps.