Julia fn clearer than C++/Fortran (examples sought)

The micro benchmarks are pretty good examples of this, especially matrix operations.

4 Likes

It probably has to do with the Random Number Generator efficiency more than the actual code.
If you used as efficient RNG in the C++ code it would be at least as fast as Julia (Given a decent compiler).

Julia should shine in the cases where it is easier to give the compiler optimizations hints about complex problems in Julia than in other languages (Fortran / C / C++). In such a simple code there shouldn’t be any advantage to any low level language over Julia on vice versa.

3 Likes

My only gripe with comparing C/C++'s efficacy/syntax to julia’s for matrix operations is the following: yes calling cblas and using C is rough, but it’s also a little contrived in my opinion. For example, you can pretty easily download armadillo, eigen, etc, and have pretty simple access to nice linear algebra and statistical routines with a single import. Performance wise, I’ve found julia to be mostly similar to armadillo in C++, and the syntax isn’t much worse.

I think it’s just really hard to compare C/C++ against julia in a way that makes Julia “exceptionally” favorable. The exception being large chunks of code and in the development experience. Not thinking about pointers, types, and still getting a working prototype in a half hour is awesome. Spending a half day tuning and getting performance gains “good enough” for production. Revisiting code and not swimming through hundreds of classes to understand it etc. Also it’s an amazing stepping stone for python/R programmers to learn more computer science(my experience). Just hard to quantify that via example?

I originally submitted something really simple to try to show this. But, I had to delete it when someone showed me how C++11 and beyond really make C++ syntax flexible and readable. The example was just as elegant with basically no imports…

C++ also shines for interop with Window’s “hosted” languages as well as gluing things at the system level. IE: gluing julia to C# is rough. Gluing C++ to C# is pretty darn easy. On Linux I can’t comment. Not that you can’t do this with Julia, but, right now it’s more work.

Again I think Julia has advantages for me over C/C++ in my daily workflow. But - I think, it’ll be hard to give really solid small examples where Julia is truly better than modern well written C/C++. The situation is more obvious with large code bases - but hard to quantify.

5 Likes

Good point. I think this kind of analysis is valid in general. It is hard to compare two codes which are written in any two languages if both are clear, elegant, and take advantage of the best features of each language. It is hard to compare performance if both are written and compiled with the best practices of both languages. It is hard to make a point with that kind of comparison. It is the workflow leading to the results and the possibilities each language offers which is mostly different.

For example, if one knows in advance every compiler flag, the latest syntax features, have already installed all libraries, there is no clear reason for Fortran being inconvenient. One may like or may not like the explicit declaration of every variable, but even that has its ups. However, if I have a fresh computer and have to write anything using a linear algebra routine, a data frame, parallelize something, from scratch, it is much, much more practical to do that in Julia than in Fortran. If that comes without a performance loss and with a clear syntax, much better. And typing and testing dynamically, much better even.

6 Likes

What you can probably demonstrate:

  • Nice Base functionality for many routine operations you should manually implement in C++ (like split strings and so on). And many high-level functionality from a plenty of packages (tables, plotting systems, etc)

  • Nice types like named tuples, fixed-size stack-allocated arrays (in addition to default dynamic arrays), which size is not bound to a compile-time constant, and so on.

  • Rich and consistent type system, let’s take for example AbstractArrays - afaik there is no abstract array interface in C++ that can be used consistently in different libraries, so they can be composed with each other with no wrappers.

  • No barrier between compile-time and run-time, so you can write @generated functions and @macro (I don’t even compare them to С++ preprocessor, because it is much more limited).

  • Introspection / homoiconity - for example you can simply get type information, convert strings to symbols, iterate over fields in a structure, and have for example generic serialization routines for all types.

4 Likes

Sairus7 -
Rich and consistent type system, let’s take for example AbstractArrays - afaik there is no abstract array interface in C++ that can be used consistently in different libraries, so they can be composed with each other with no wrappers.

I thought the same from having some experience in C++ 2003. But C++11’s template system is pretty darn solid. Maybe there’s stuff it can’t handle, but from what I saw, it’s pretty darn flexible. My point isn’t that it does exist - but someone could pop in and go “well if you do this this and this now it does exist”

Are arrays stack-allocated?! I think the default ones aren’t. And probably neither even if initialized with size from constant variable (array itself is still dynamic).

That said, what I think you have in mind is, StaticArrays.jl (and similar), I believe one of its point is they get stack-allocated (the syntax to set them up is however then a bit more than in C), how it gets its speed (for non-large arrays).

[I’m not sure if [named] tuples are stack-allocated.]

You mean with:

Yes, calling from C seems simple; at the C-side, as you need to decorate at the C# site:

I didn’t look into it there’s a simpler way for C or C++, is it for sure simpler than DotNET.jl?

is DotNetJL stable? I never saw this back when that issue was plaguing my life. Or maybe I did, ran it, found it didn’t work - hard to say.

Yes, arrays are dynamically-sized, but in Julia you can also have stack-allocated arrays with arbitrary fixed length. In C++ you cannot do that.

1 Like

Of course C++ will be slower, Mersenne Twister is big and slow, By using one of the simpler RNGs from the random header like minstd_rand I got approximately a 2x boost for C++ over the Julia code on my machine, Before the change C++ was about 10x slower than Julia on my machine.

Julia’s default RNG is Mersenne Twister which is presumably the reason for that comparison. We have faster RNGs as well though.

2 Likes

Thank you for the info, Indeed Julia uses it through this C library,

http://www.math.sci.hiroshima-u.ac.jp/m-mat/MT/SFMT/#dSFMT

An example of a faster RNG in julia would be VectorizedRNG.jl:

using VectorizedRNG
calc_pi(M) = 4 * count(rand()^2 + rand()^2 < 1 for i=1:M) / M
calc_pi_vec(M) = 4 * count(rand(local_rng())^2 + rand(local_rng())^2 < 1 for i=1:M) / M

@time calc_pi(10^9)
@time calc_pi_vec(10^9)
#+end_src

#+RESULTS:
:  11.632509 seconds
:   6.440428 seconds
2 Likes

It’s worth remembering that what is slow is mostly accessing the default RNG (this should change in Julia 1.7). But MersenneTwister itself is not so slow. In you calc_pi example above, it takes for me “16.633077 seconds”. But only “3.798340 seconds” when I use an explicit RNG:

calc_pi(M) = (rng=MersenneTwister(); 4 * count(rand(rng)^2 + rand(rng)^2 < 1 for i=1:M) / M)
4 Likes

Sorry, if this is getting offtopic.

The fastest implementation that I could get is

using Random

function calc_pi(M)
    rng=MersenneTwister()
    batch_length = 2^12
    n_batches = fld(M, batch_length)
    rest = rem(M, batch_length)
    batch = Vector{Float64}(undef, 2*batch_length)
    sum = 0
    for i = 1:n_batches
        rand!(rng, batch)
        @inbounds   for j = 1:batch_length
            r1 = batch[2*j - 1]
            r2 = batch[2*j]
            sum += r1*r1 + r2*r2 < 1
        end
    end

    if rest > 0
        resize!(batch, 2*rest)
        rand!(rng, batch)
        for j = 1:rest
            r1 = batch[2*j - 1]
            r2 = batch[2*j]
            sum += r1*r1 + r2*r2 < 1
        end
    end

    return 4*sum/M
end

@time println(calc_pi(10^9))

This takes only 2 s on my laptop, but is not clearer any more than the C++ version.

1 Like

Maybe not allowed by the rules of the game but…

calc_pi(M) = (rng=MersenneTwister(); 4 * count(rand(rng)^2 + rand(rng)^2 < 1 for i=1:M) / M)

function calc_pi_parallel(M)
    nt = Threads.nthreads()
    pis = zeros(nt)
    Threads.@threads for i in 1:nt
        pis[Threads.threadid()] += calc_pi(ceil(Int, M / nt))
    end
    return sum(pis) / nt  # average value
end

Your version on my machine:

2.181 s (13 allocations: 83.64 KiB)

Parallel version above:

800.606 ms (98 allocations: 119.89 KiB)

Very readable as well.

It’s not pretty, but you can use alloca() in C/C++ to get a stack-allocated array whose size is chosen at runtime:

melis@juggle 12:55:~$ cat talloc.cc 
#include <alloca.h>
#include <stdlib.h>
#include <cstdio>

int sum(int *arr, int n)
{
    int res = 0;
    for (int i = 0; i < n; i++)
        res += arr[i];
    return res;
}

int main(int argc, char *argv[])
{
    int N = atoi(argv[1]);

    // Allocates on the caller's stack
    int *arr = (int*) alloca( sizeof(int) * N );

    for (int i = 0; i < N; i++) arr[i] = i;
    printf("%d\n", sum(arr, N));
}

melis@juggle 12:54:~$ ./talloc 10
45
1 Like

local_rng() apparently doesn’t like generators.

julia> function calc_pi_vec(M, rng)
           s = 0;
           for m ∈ 1:M
               s += (rand(rng)^2 + rand(rng)^2) < 1
           end
           4s/M
       end
calc_pi_vec (generic function with 2 methods)

julia> @time calc_pi_vec(10^9, Random.default_rng())
  2.535990 seconds
3.141568076

julia> @time calc_pi_vec(10^9, local_rng())
  2.104100 seconds
3.141531412

julia> calc_pi_vec_gen(M, rng) = 4 * count(rand(rng)^2 + rand(rng)^2 < 1 for i=1:M) / M
calc_pi_vec_gen (generic function with 2 methods)

julia> @time calc_pi_vec_gen(10^9, Random.default_rng())
  2.534998 seconds
3.141514884

julia> @time calc_pi_vec_gen(10^9, local_rng())
  4.033642 seconds
3.14165232

I’m starting to try and use llvmcall less often.
Using llvmcall means you basically have to mark everything @inline, or it won’t.

Still, I’m surprised how close these are in terms of performance. When evaluated on a vector:

julia> using VectorizedRNG

julia> x = Vector{Float64}(undef, 1024);

julia> @benchmark rand!($(local_rng()), $x)
samples: 10000; evals/sample: 728; memory estimate: 0 bytes; allocs estimate: 0
ns

 (173.72 - 174.21]  ██████████████████████████████ 8719
 (174.21 - 174.7 ]   0
 (174.7  - 175.19]  ▏13
 (175.19 - 175.68]  ████1135
 (175.68 - 176.17]  ▍90
 (176.17 - 176.66]  ▏14
 (176.66 - 177.15]  ▏3
 (177.15 - 177.64]  ▏1
 (177.64 - 178.14]  ▏2
 (178.14 - 178.63]  ▏2
 (178.63 - 179.12]  ▏1
 (179.12 - 179.61]  ▏3
 (179.61 - 180.1 ]  ▏3
 (180.1  - 180.59]  ▏4
 (180.59 - 218.79]  ▏10

                  Counts

min: 173.718 ns (0.00% GC); mean: 174.014 ns (0.00% GC); median: 173.783 ns (0.00% GC); max: 218.786 ns (0.00% GC).

julia> @benchmark rand!($(Random.default_rng()), $x)
samples: 10000; evals/sample: 178; memory estimate: 0 bytes; allocs estimate: 0
ns

 (632.1 - 637.6]  ▏1
 (637.6 - 643.0]   0
 (643.0 - 648.4]  ██████████████████████████████ 5793
 (648.4 - 653.8]  ███████████████████3645
 (653.8 - 659.3]  ██▊523
 (659.3 - 664.7]  ▏13
 (664.7 - 670.1]  ▏1
 (670.1 - 675.6]  ▏5
 (675.6 - 681.0]  ▏2
 (681.0 - 686.4]  ▏3
 (686.4 - 691.8]  ▏2
 (691.8 - 697.3]  ▏1
 (697.3 - 702.7]   0
 (702.7 - 708.1]  ▏1
 (708.1 - 781.1]  ▏10

                  Counts

min: 632.140 ns (0.00% GC); mean: 648.378 ns (0.00% GC); median: 647.208 ns (0.00% GC); max: 781.112 ns (0.00% GC).

It’d be good to add more API around VectorizedRNG’s vector mode evaluation.

Eventually, I’d like to add LV support, where rand turns into calls using VectorizedRNG.

I’d really like an alloca in Julia.

2 Likes