Clean Numerical Simulation (Taylor Series Method)

Hi all, very new to Julia, and not in academia BTW. I have been investigating the opinions and findings in this paper, and I thought I would contribute some Julia code which performs the aforementioned simulation to potentially ludicrous levels of accuracy, in fewer than 60 lines:

#!/usr/bin/env julia

using Printf

function main()
    setprecision(trunc(Int, parse(Int, ARGS[1]) * log(10.0) / log(2.0)))
    order = parse(Int, ARGS[2])
    h = parse(BigFloat, ARGS[3])
    steps = parse(Int, ARGS[4])
    x = parse(BigFloat, ARGS[5])
    y = parse(BigFloat, ARGS[6])
    z = parse(BigFloat, ARGS[7])
    Ļƒ = parse(BigFloat, ARGS[8])
    Ļ = parse(BigFloat, ARGS[9])
    Ī² = parse(BigFloat, ARGS[10]) / parse(BigFloat, ARGS[11])
    D0 = parse(BigFloat, "0.0")
    cx = [D0 for _ = 1 : order + 1]
    cy = [D0 for _ = 1 : order + 1]
    cz = [D0 for _ = 1 : order + 1]
    wx = [D0 for _ = 1 : order]
    wy = [D0 for _ = 1 : order]
    wz = [D0 for _ = 1 : order]
    for step = 1 : steps
        @printf "%0.9e %0.9e %0.9e %0.5e\n" x y z (step - 1) * h
        cx[1] = x
        cy[1] = y
        cz[1] = z
        for k = 1 : order
            wx[k] = cx[k]
            wy[k] = cy[k]
            wz[k] = cz[k]
            cx[k+1] = Ļƒ * (wy[k] - wx[k]) / k
            wxz = D0
            for j = 1 : k
                wxz += wx[j] * wz[k-j+1]
            end
            cy[k+1] = (Ļ * wx[k] - wxz - wy[k]) / k
            wxy = D0
            for j = 1 : k
                wxy += wx[j] * wy[k-j+1]
            end
            cz[k+1] = (wxy - Ī² * wz[k]) / k
        end
        x = cx[order+1]
        y = cy[order+1]
        z = cz[order+1]
        for i = order : -1 : 1
            x = x * h + cx[i]
            y = y * h + cy[i]
            z = z * h + cz[i]
        end
    end
end

main()

Can anyone suggest any speedups? Would this be any use as a benchmark? BTW here is an invocation:

time -p ./tsm-lorenz.jl 160 100 .01 30001 -15.8 -17.48 35.64 10 28 8 3 >/tmp/data-Julia

[code edited to eliminate four useless lines!]

1 Like

This is very nice, compact code.
In https://github.com/JuliaDiff/TaylorSeries.jl and especially GitHub - PerezHz/TaylorIntegration.jl: ODE integration using Taylor's method, and more, in Julia, this has been generalised to use the Taylor method to solve ā€œarbitraryā€ ODEs.

2 Likes

Thank you, very kind! It is amazing to me how little code this method requires, I suppose because the integrator and the model are so tightly, er, integrated :wink:

I have been looking into the packages you mentioned to an extent, but had problems getting everything installed under the 0.7s . . . anyway I was already getting involved in the inner workings of the method so I just kept going with my own stuff. I now have a reasonably useful set of functions at my disposal that I implemented myself.

I mentioned benchmarking, since my code has no dependencies apart from Printf, and does not use accelerated maths hardware. I have observed that this code runs in around 150% of the time for an equivalent c implementation, so if I can use any trickery to close the gap, that would be interesting!

Incidentally, under the same code conditions, Python 3 (using Decimal) is 3 to 4 times slower than Julia.

What do you mean by an equivalent C formulation? Using MPFR directly?

I believe that Juliaā€™s current implementation of BigFloat requires creating new objects in each calculation, which you can avoid to some extent with optimised MPFR calls. This could explain some of the performance gap.

Yes. I ported the code to c (which I am not familiar with either) as I also wanted to see if the c code (which uses code generation) linked from this paper was efficient. In the event mine and theirs are indistinguishable speed-wise, so both are probably nearly optimum for MPFR.

And I canā€™t believe I didnā€™t combine those last three loops in the Julia code - now 55 lines!

Anyway, if you are curious, here is the c version, which I compiled with:

gcc -Wall -o tsm-lorenz tsm-lorenz.c -lm -lmpfr -s

(I havenā€™t experimented with optimization yet)

#include<math.h>
#include<stdio.h>
#include<stdlib.h>
#include<mpfr.h>

mpfr_t *jet (int size) {
    mpfr_t *j = (mpfr_t *)malloc(sizeof(mpfr_t) * size);
    for (int i = 0; i < size; i++) {
        mpfr_init(j[i]);
    }
    return j;
}

int main(int argc, char **argv) {
    const int BASE = 10;
    long order, nsteps;
    mpfr_t t, x, y, z, s, r, b, h, t1, t2, t3, *w1, *w2, *w3, *cx, *cy, *cz;
    /* initialize from command arguments */
    mpfr_set_default_prec(strtod(argv[1], NULL) * log(10.0) / log(2.0));
    order = strtol(argv[2], NULL, BASE);
    mpfr_init_set_str(h, argv[3], BASE, GMP_RNDN);
    nsteps = strtol(argv[4], NULL, BASE);
    mpfr_init(t1);
    mpfr_init(t2);
    mpfr_init(t3);
    mpfr_init_set_str(x, argv[5], BASE, GMP_RNDN);
    mpfr_init_set_str(y, argv[6], BASE, GMP_RNDN);
    mpfr_init_set_str(z, argv[7], BASE, GMP_RNDN);
    mpfr_init_set_str(t, "0.0", BASE, GMP_RNDN);
    mpfr_init_set_str(s, argv[8], BASE, GMP_RNDN);
    mpfr_init_set_str(r, argv[9], BASE, GMP_RNDN);
    mpfr_init_set_str(b, argv[10], BASE, GMP_RNDN);
    mpfr_init_set_str(t1, argv[11], BASE, GMP_RNDN);
    mpfr_div(b, b, t1, GMP_RNDN);
    /* initialize the derivative and temporary jets */
    cx = jet(order + 1);
    cy = jet(order + 1);
    cz = jet(order + 1);
    w1 = jet(order);
    w2 = jet(order);
    w3 = jet(order);
    /* main loop */
    for (long step = 1; step < nsteps + 1; step++) {
        /* print a line of output */
        mpfr_out_str(stdout, BASE, 10, x, GMP_RNDN);
        printf(" ");
        mpfr_out_str(stdout, BASE, 10, y, GMP_RNDN);
        printf(" ");
        mpfr_out_str(stdout, BASE, 10, z, GMP_RNDN);
        printf(" ");
        mpfr_out_str(stdout, BASE, 6, t, GMP_RNDN);
        fprintf(stdout, "\n");
        /* compute the taylor coefficients */
        mpfr_set(cx[0], x, GMP_RNDN);
        mpfr_set(cy[0], y, GMP_RNDN);
        mpfr_set(cz[0], z, GMP_RNDN);
        for (int k = 0; k < order; k++) {
            mpfr_set(w1[k], cx[k], GMP_RNDN);
            mpfr_set(w2[k], cy[k], GMP_RNDN);
            mpfr_set(w3[k], cz[k], GMP_RNDN);
            /* x dot */
            mpfr_sub(t3, w2[k], w1[k], GMP_RNDN);
            mpfr_mul(t3, t3, s, GMP_RNDN);
            mpfr_div_si(cx[k + 1], t3, k + 1, GMP_RNDN);
            /* y dot */
            mpfr_mul(t3, r, w1[k], GMP_RNDN);
            mpfr_set_zero(t2, 1);
            for (int j = 0; j < (k + 1); j++) {
                mpfr_mul(t1, w1[k-j], w3[j], GMP_RNDN);
                mpfr_add(t2, t1, t2, GMP_RNDN);
            }
            mpfr_sub(t3, t3, t2, GMP_RNDN);
            mpfr_sub(t3, t3, w2[k], GMP_RNDN);
            mpfr_div_si(cy[k + 1], t3, k + 1, GMP_RNDN);
            /* z dot */
            mpfr_set_zero(t2, 1);
            for (int j = 0; j < (k + 1); j++) {
                mpfr_mul(t1, w1[k-j], w2[j], GMP_RNDN);
                mpfr_add(t2, t1, t2, GMP_RNDN);
            }
            mpfr_mul(t3, b, w3[k], GMP_RNDN);
            mpfr_sub(t3, t2, t3, GMP_RNDN);
            mpfr_div_si(cz[k + 1], t3, k + 1, GMP_RNDN);
        }
        /* sum the series using Horner's method and advance one step */
        mpfr_set(x, cx[order], GMP_RNDN);
        mpfr_set(y, cy[order], GMP_RNDN);
        mpfr_set(z, cz[order], GMP_RNDN);
        for (int j = order - 1; j >= 0; j--) {
            mpfr_mul(x, x, h, GMP_RNDN);
            mpfr_add(x, x, cx[j], GMP_RNDN);
            mpfr_mul(y, y, h, GMP_RNDN);
            mpfr_add(y, y, cy[j], GMP_RNDN);
            mpfr_mul(z, z, h, GMP_RNDN);
            mpfr_add(z, z, cz[j], GMP_RNDN);
        }
        mpfr_mul_si(t, h, step, GMP_RNDN);
    }
    exit(0);
}

And that is why we Julia instead of Cā€¦

3 Likes

Harsh! Well I use the fastest one :wink:

The sequence of calls in the c version is the quickest, by experiment. I wonder if Julia uses mpfr_fma() under the covers, that is a very slow call and could account for much of the speed difference.

But after your comment I couldnā€™t help responding with some c propoganda :wink:

$ ls -lAh tsm-lorenz
-rwxrwxr-x 1 ian ian 11K Aug  6 22:30 tsm-lorenz

$ ldd tsm-lorenz
	linux-vdso.so.1 =>  (0x00007ffda833a000)
	libmpfr.so.4 => /usr/lib/x86_64-linux-gnu/libmpfr.so.4 (0x00007f9d5bb46000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9d5b77c000)
	libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f9d5b4fc000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f9d5bdab000)

and from top:

  PID  PPID USER     TTY       PR  NI    RES %MEM S  %CPU P   TIME COMMAND                                                                                                                                                                                                     

20295  7497 ian      pts/9     20   0   0.8m  0.0 R  99.7 3   0:05                                  `- ./tsm-lorenz 160 100 .01 30001 -15.8 -17.48 35.64 10 28 8 3                                                                                                             

20297  7497 ian      pts/9     20   0 211.2m  2.7 R 100.0 2   0:04                                  `- ./julia ./tsm-lorenz.jl 160 100 .01 30001 -15.8 -17.48 35.64 10 28 8 3

I want to restate that this is probably the reason for the major performance discrepancy. Each MPFR number is allocated on the heap (which takes time) and in your C code you reuse these numbers, i.e., your numbers are mutable, whereas in Julia numbers are by default immutable and thus each operation generates a new MPFR number (which has to be allocated again). A fair comparison would be to also use the imperative MPFR API in your Julia code. So far this has nothing to do with C vs. Julia but rather how to use the MPFR library most efficiently.

2 Likes

This thread is not about c vs Julia per se, I would really like to user Julia for running this stuff!

So, is it realistic to cache used MPFR instances within Julia and re-use them? That is certainly the strategy recommended by the MPFR docs.

BTW searched the internet for ā€œimperative MPFR APIā€ and nothing came up, what is it?

OK found the source. I suspect those calls would make the Julia code look worse than the c I posted :wink:

The immutability condition would seem to rule out any improvement I guess.

Is this an issue that devs would care enough about to address?