External processes called by Julia locked to 100% cpu use, even when run in parallel?

Am I don’t something wrong / is this a known issue / what are the workaround?
Below I added some sample C++ code.
All it is a bunch of calls to sin to burn enough time so I can watch system resources in top.

$ g++ -Ofast -fopenmp -shared -fPIC piecewiseshared_thread_test.cpp -o pw_thread_test.so
$ g++ -Ofast -shared -fPIC piecewiseshared_thread_test.cpp -o pw_thread_test_off.so
julia> path = "/home/chris/Documents/progwork/cpp/"
"/home/chris/Documents/progwork/cpp/"

julia> function intsothreadtestf(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10^8)
                  ccall(f, Int32, (Float32, Float32, Int64), l, u, n)
              end
intsothreadtestf (generic function with 4 methods)

julia> libcpp = Libdl.dlopen(path*"pw_thread_test.so")
Ptr{Void} @0x0000000002d4f1e0

julia> const intd = Libdl.dlsym(libcpp, :integratef)
Ptr{Void} @0x00007fc646719ab0

julia> binomial(16,2)
120

julia> @time intsothreadtestf(intd)
  0.443687 seconds (1.62 k allocations: 96.805 KiB)
120

julia> @time intsothreadtestf(intd)
  0.450230 seconds (4 allocations: 160 bytes)
120

julia> const libcppoff = Libdl.dlopen(path*"pw_thread_test_off.so")
Ptr{Void} @0x0000000002dccec0

julia> intdoff = Libdl.dlsym(libcppoff, :integratef)
Ptr{Void} @0x00007fc63e682840

julia> @time intsothreadtestf(intdoff)
  0.431182 seconds (4 allocations: 160 bytes)
0

julia> @time intsothreadtestf(intdoff)
  0.431196 seconds (4 allocations: 160 bytes)
0

Performance is similar, and watching top showed Julia jumping to just 100.3% cpu on occasion with the loop size cranked to 10^9.

Versus compiling executables:

$ g++ -Ofast -fopenmp piecewise.cpp -o pw_thread_test
$ g++ -Ofast piecewise.cpp -o pw_nothread_test
$ time ./pw_thread_test
2
real	0m0.033s
user	0m0.391s
sys	0m0.004s
$ time ./pw_nothread_test
2
real	0m0.222s
user	0m0.222s
sys	0m0.000s

While the non-threaded version’s time is about the same as all the ccalls, the threaded version was several times faster.
As expected, user time is also many times higher than real.

With 10^9 evaluations and Float64 instead of 10^8 and Float32 like in the above examples, ccalling the threaded version in Julia was about twice as fast, even though it still never passed 100.3% cpu.
It clearly shows the expected number of threads were created, and the sometimes better performance (less cache misses, a la Optimal Number of Workers for Parallel Julia?) makes it seem to me like the ccalled processes are capped, GIL-style?

Anyone familiar with this sort of thing?

#include <cmath>
#include <algorithm>
#include <omp.h>

template<class T>
T pairint(int l, int n, T dx){
    T sum_x ;
    int n_i ;
    if ( n < 20 ){
        int u = l + n - 1 ;
        sum_x = 0.0 ;
        for( n_i = l; n_i <= u; ++n_i){
            sum_x = sum_x + std::sin( n_i * dx ) ;
        }
    } else {
        n_i = n / 2 ;
        sum_x = pairint(l, n_i, dx) + pairint(l + n_i, n - n_i, dx) ;
    }
    return sum_x ;
}

template<class T>
T vec_sum(T v[], int l){
    T sum = 0 ;
    for(int i = 0; i < l; i++){
        sum = sum + v[i] ;
    }
    return sum ;
}

extern "C" int integratef(const float, const float, const long);
extern "C" int integrated(const double, const double, const long);

template<class T>
int integrate(const T l, const T u, const long n_lim)
{

    const int       nthread = 16;
    const T         dx = (u - l) / n_lim;
    int             nrange[nthread+1];

    T sinx = 0.5 * ( std::sin(l) + std::sin(u) ) ;
    for (int i = 0; i <= nthread; ++i){
        nrange[i] = 1 + (n_lim-1) * i / nthread;
    }

    int             thread_num_total = 0;


    #if defined(_OPENMP)
        omp_set_num_threads(nthread);
    #endif

    #pragma omp parallel for reduction(+:sinx,thread_num_total)
    for (int i = 0; i < nthread; ++i){
        sinx += pairint(nrange[i], nrange[i+1]-nrange[i], dx);
        #if defined(_OPENMP)
            thread_num_total += omp_get_thread_num();
        #endif
    }

    sinx = dx * sinx ;

    return thread_num_total;
}

int integratef(const float l, const float u, const long n_lim){
    return integrate(l, u, n_lim);
}
int integrated(const double l, const double u, const long n_lim){
    return integrate(l, u, n_lim);
}

I have bin bitten by Julia locking other processes down to a single core again, this time through using RCall to call rstan.

Launching 8 chains in parallel, each chain is running at 12.5% CPU.
I was using rstan so that I don’t have to both installing CmdStan, and because I’ve had a few hard drives fail over the past few years so the fact that CmdStan saves samples to disk instead of RAM is a point towards rstan.

I was originally intending to use tpapp’s fantastic looking DynamicHMC, but this is for a take home exam due in just a few hours so I don’t have enough time at the moment to learn something new (or wait for 8 chains to run on a single core lol). I do intend to move there in the future.

EDIT:
This is also the case when using Stan.jl.
Has anyone else reproduced this, or encountered this before?

EDIT:
A few hours ago, I noticed that two totally separate Julia processes – separate instances of launching the Julia REPL form the command line - where each running at 50%, split on the same logical core.
Not even separate logical cores on the same physical core (which would be bad), but on the very same logical core!
This is on a 16 core/32 thread CPU, where the other 15 cores were idle.

Finally trying this example on a different computer:
Everything worked fine. Could not reproduce.

Definitely not a Julia problem, but something more hardware/OS/driver specific.