Am I don’t something wrong / is this a known issue / what are the workaround?

Below I added some sample C++ code.

All it is a bunch of calls to `sin`

to burn enough time so I can watch system resources in `top`

.

```
$ g++ -Ofast -fopenmp -shared -fPIC piecewiseshared_thread_test.cpp -o pw_thread_test.so
$ g++ -Ofast -shared -fPIC piecewiseshared_thread_test.cpp -o pw_thread_test_off.so
```

```
julia> path = "/home/chris/Documents/progwork/cpp/"
"/home/chris/Documents/progwork/cpp/"
julia> function intsothreadtestf(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10^8)
ccall(f, Int32, (Float32, Float32, Int64), l, u, n)
end
intsothreadtestf (generic function with 4 methods)
julia> libcpp = Libdl.dlopen(path*"pw_thread_test.so")
Ptr{Void} @0x0000000002d4f1e0
julia> const intd = Libdl.dlsym(libcpp, :integratef)
Ptr{Void} @0x00007fc646719ab0
julia> binomial(16,2)
120
julia> @time intsothreadtestf(intd)
0.443687 seconds (1.62 k allocations: 96.805 KiB)
120
julia> @time intsothreadtestf(intd)
0.450230 seconds (4 allocations: 160 bytes)
120
julia> const libcppoff = Libdl.dlopen(path*"pw_thread_test_off.so")
Ptr{Void} @0x0000000002dccec0
julia> intdoff = Libdl.dlsym(libcppoff, :integratef)
Ptr{Void} @0x00007fc63e682840
julia> @time intsothreadtestf(intdoff)
0.431182 seconds (4 allocations: 160 bytes)
0
julia> @time intsothreadtestf(intdoff)
0.431196 seconds (4 allocations: 160 bytes)
0
```

Performance is similar, and watching `top`

showed Julia jumping to just `100.3%`

cpu on occasion with the loop size cranked to 10^9.

Versus compiling executables:

```
$ g++ -Ofast -fopenmp piecewise.cpp -o pw_thread_test
$ g++ -Ofast piecewise.cpp -o pw_nothread_test
$ time ./pw_thread_test
2
real 0m0.033s
user 0m0.391s
sys 0m0.004s
$ time ./pw_nothread_test
2
real 0m0.222s
user 0m0.222s
sys 0m0.000s
```

While the non-threaded version’s time is about the same as all the ccalls, the threaded version was several times faster.

As expected, user time is also many times higher than real.

With 10^9 evaluations and Float64 instead of 10^8 and Float32 like in the above examples, ccalling the threaded version in Julia was about twice as fast, even though it still never passed 100.3% cpu.

It clearly shows the expected number of threads were created, and the sometimes better performance (less cache misses, a la Optimal Number of Workers for Parallel Julia?) makes it seem to me like the ccalled processes are capped, GIL-style?

Anyone familiar with this sort of thing?

```
#include <cmath>
#include <algorithm>
#include <omp.h>
template<class T>
T pairint(int l, int n, T dx){
T sum_x ;
int n_i ;
if ( n < 20 ){
int u = l + n - 1 ;
sum_x = 0.0 ;
for( n_i = l; n_i <= u; ++n_i){
sum_x = sum_x + std::sin( n_i * dx ) ;
}
} else {
n_i = n / 2 ;
sum_x = pairint(l, n_i, dx) + pairint(l + n_i, n - n_i, dx) ;
}
return sum_x ;
}
template<class T>
T vec_sum(T v[], int l){
T sum = 0 ;
for(int i = 0; i < l; i++){
sum = sum + v[i] ;
}
return sum ;
}
extern "C" int integratef(const float, const float, const long);
extern "C" int integrated(const double, const double, const long);
template<class T>
int integrate(const T l, const T u, const long n_lim)
{
const int nthread = 16;
const T dx = (u - l) / n_lim;
int nrange[nthread+1];
T sinx = 0.5 * ( std::sin(l) + std::sin(u) ) ;
for (int i = 0; i <= nthread; ++i){
nrange[i] = 1 + (n_lim-1) * i / nthread;
}
int thread_num_total = 0;
#if defined(_OPENMP)
omp_set_num_threads(nthread);
#endif
#pragma omp parallel for reduction(+:sinx,thread_num_total)
for (int i = 0; i < nthread; ++i){
sinx += pairint(nrange[i], nrange[i+1]-nrange[i], dx);
#if defined(_OPENMP)
thread_num_total += omp_get_thread_num();
#endif
}
sinx = dx * sinx ;
return thread_num_total;
}
int integratef(const float l, const float u, const long n_lim){
return integrate(l, u, n_lim);
}
int integrated(const double l, const double u, const long n_lim){
return integrate(l, u, n_lim);
}
```