Improving ccall speed for many calls

Hi!

I am calling a shared library (a model written in Fortran) using ccall. I need to call the model many thousand times. The code is quite fast, but I am wondering if there is a way to speed it up since the current setup is rather slow for some applications. Is the shared library loaded somehow every time I use ccall? If so, does that take extra time and requires memory allocations that can be avoided somehow, some sort of preloading perhaps?

If it helps to answer my question, you find the code of the shared library here. The Fortran subroutine I am calling is called FSM.f90. This file containes the Julia code where I use ccall. Obviously, you find my whole Julia package here.

I hope to convince my collegues that Julia is the way to go, but then I need to improve the speed of this code a bit since it is currently not much faster than the current Matlab setup. Any help greatly appreciated :slight_smile:
Jan

No.

If the code is mostly calling C code I wouldn’t expect to be faster than matlab code assuming both uses the same C library. It’ll be much easier to write (from scratch) though.

Thanks for the answer.

For the Matlab setup, I am passing data by writing and reading binary files since it is more difficult to call shared libraries in Matlab than in Julia. However, in Matlab I run the model for more time steps and grid points at once. Thus, in Matlab I am passing much larger chunks of data back and fourth for each call than in Julia, where I run the model for one time step and grid point in each call. The Fortran code is kind of big and contains quite a large number of constants etc that needs to be initialized. Does it take time to initialize the shared library each time it is called? Would you expect that the Julia version gets faster if I pass larger chunks of data at once, and reduce the number of calls instead?

no it does not take any initializing time if you call the shared library. ccall has negligible overhead and thus can even be called for scalar operations.

Why do you pass different chunk sizes from Matlab and Julia?

I’m not sure it’ll really matter though, given that ccall’s overhead is so low.

I agree with:

I would suspect that with a large enough batch size, MATLAB is likely close to “optimal”. I took at look at your code and noticed that it’s pretty much all calling out to Fortran. I wouldn’t be surprised if the majority of the time is sitting in that Fortran code, and in that case both the MATLAB code and the Julia code should run for about the same time.

But yeah, writing a ccall is much easier than writing some MEX glue :slight_smile:.

Julia’s speed vs MATLAB will really shine when you are writing the code in MATLAB. If you translate that Fortran code to Julia, I would suspect that you’d have a <2x difference in runtime, and you’d get things like compatibility arbitrary precision numbers almost for free. I don’t think either of those would be true for MATLAB.

1 Like

I test this overhead in CxxWrap, by dividing an array of 50M float64 numbers by 2 in a loop. Comparing pure julia and having a function ccalled for each number I see a factor 2 slowdown, which is good considering how little work this function does:

Timings:
Pure Julia test:
  0.061723 seconds (4 allocations: 160 bytes)
ccall test:
  0.092434 seconds (4 allocations: 160 bytes)

So in most cases the overhead should only be an issue if the called function is almost trivial, and then it usually is easy to rewrite in Julia.

The benefit of inlining is AFAIU often not avoiding the overhead in the function calls but the extra optimization opportunities available to the compiler.

1 Like

Thanks for all replies, and sorry for my rather late answer. Too busy.

I made some tests calling different Fortran subroutines that multiply a number or a vector by 2. You find details about those tests here.

The overhead for using ccall and Fortran is approximately a factor of 2 compared to using a pure Julia function. So I am convinced, the overhead of ccall is small as you all pointed out above. One big plus point for Julia :slight_smile:.

To tknopp: You asked “Why do you pass different chunk sizes from Matlab and Julia?” I did it because it takes less time to write a certain amount of data to one file, instead of writing the same amount of data to many files. In Julia, I do not pass data between Fortran and Julia using files, so there it does not matter.

Thanks again for all replies. Great community.
Jan

1 Like

Did you run the functions twice? The first time will include compilation time. I would expect multiplying a number by 2 to be almost instant after the first call.

Yes, I ran the functions twice, but I did not multiply only a single number by 2, but a vector with 1000000 entries. So that is why the results do not appear instantaneously I think. I guess you already noticed my link to the Julia package with my tests including the readme in my last post, but if not you find them here:

https://github.com/jmgnve/ccall_test.jl

You should add -march=native to the gfortran compile flags in your deps/build.jl. This allows gfortran to emit SIMD vectorized code that fits your CPU architecture. (-O3 does not imply CPU architecture settings.) Otherwise it’ll default to the slower scalar code. Most likely then the speed difference goes away since Julia instructs LLVM by default to same ‘native’ setting when optimisation is enabled (as is by default).

To see the difference you may use objdump -d array_routine.so to show the disassembled code, which is essentially the same you would see from @code_native in Julia.

Also, -O3 can lead to suboptimal code when the optimiser doesn’t have enough information on how functions are exactly used and, ultimately, makes too aggressive assumptions. This is in particular a problem when writing shared libraries that iterate over arrays. So sometimes -O2 gives faster code. Be sure to check both settings.

Addendum:
To clarify, the above is with regard to the comment ‘The array function is much slower, perhaps due to memory allocation in the fortran subroutine.’
So, here’s the disassembly of the Fortran code using -O2 and -O2 -march=native:

0000000000000610 <array_routine_>:                              0000000000000610 <array_routine_>:
 610:   48 8b 0a                mov    (%rdx),%rcx               610:   48 8b 0a                mov    (%rdx),%rcx
 613:   b8 01 00 00 00          mov    $0x1,%eax                 613:   b8 01 00 00 00          mov    $0x1,%eax
 618:   48 85 c9                test   %rcx,%rcx                 618:   48 85 c9                test   %rcx,%rcx
 61b:   48 8d 51 01             lea    0x1(%rcx),%rdx            61b:   48 8d 51 01             lea    0x1(%rcx),%rdx
 61f:   7e 20                   jle    641 <array_routine_+0x    61f:   7e 20                   jle    641 <array_routine_+0x
 621:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)                 621:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 628:   f2 0f 10 44 c7 f8       movsd  -0x8(%rdi,%rax,8),%xmm |  628:   c5 fb 10 44 c7 f8       vmovsd -0x8(%rdi,%rax,8),%xmm
 62e:   f2 0f 58 c0             addsd  %xmm0,%xmm0            |  62e:   c5 fb 58 c0             vaddsd %xmm0,%xmm0,%xmm0
 632:   f2 0f 11 44 c6 f8       movsd  %xmm0,-0x8(%rsi,%rax,8 |  632:   c5 fb 11 44 c6 f8       vmovsd %xmm0,-0x8(%rsi,%rax,8
 638:   48 83 c0 01             add    $0x1,%rax                 638:   48 83 c0 01             add    $0x1,%rax
 63c:   48 39 d0                cmp    %rdx,%rax                 63c:   48 39 d0                cmp    %rdx,%rax
 63f:   75 e7                   jne    628 <array_routine_+0x    63f:   75 e7                   jne    628 <array_routine_+0x
 641:   f3 c3                   repz retq                        641:   f3 c3                   repz retq

And it appears the -O3 -march=native is actually significantly faster on an AVX capable machine.

1 Like

These results (the factor of 2) are completely in line with the benchmark I do in CxxWrap.jl, which is pretty much identical except that I divide by 2 :slight_smile: A factor 2 for such a simple function is very good, on par I think with calling a non-inlinable function on each array element in C.

1 Like

Thanks for all tips. I tried the different compiler settings on my machine. That doesn´t seem to affect the execution time very much. The ccall options are approximately a factor 2 slower than pure Julia. This slowdown is okay for my applications.