Dot function

tk3369 · January 28, 2018, 4:32am

I tried the updated code but it’s not returning the correct result. I’m not familiar with BLAS and can’t see what’s the issue right away… Just FYI. I may work on it more later tonight after dinner

julia> function logistic_regression(Y, X, w, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression (generic function with 2 methods)

julia> w = ones(2); logistic_regression(Y, X, w, 1000)
2-element Array{Float64,1}:
 -75254.3 
  -4665.47

The correct result should be as follows:

julia> function logistic_regression(Y, X, w, iterations)
           for i in 1:iterations
               w -= (((1.0 ./ (1.0 + exp.(-Y .* (X * w))) - 1.0) .* Y)' * X)'
           end
           w
       end
logistic_regression (generic function with 2 methods)

julia> w = ones(2); logistic_regression(Y, X, w, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0

ChrisRackauckas · January 28, 2018, 4:34am

There’s definitely some things that can be added. I’ve recommended before that @. adds @views so that way @. a .= b[1:5] + c[6:10] is allocation-free (well, on v0.7 with the view optimizations). As for doing something with internal matrix multiplications, it would require pulling a temporary array from somewhere. Something like ParallelAccelerator.jl places the macro on the whole function so it can do those kinds of optimizations, but it’s not a single-line thing.

But IMO I prefer for that to be explicit, because then you see things like, you really should be passing the cache variables here if you’re calling this in a loop.

Elrod · January 28, 2018, 4:39am

Oops, okay. First thing I noticed:

julia> (((1.0 ./ (1.0 .+ e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
2-element Array{Float64,1}:
 -75255.3 
  -4666.47

Fixed function:

function logistic_regression!(w, Y, X, iterations, step = 1.0)
    Xw = similar(w, size(X,1))
    buffer = similar(w, length(Y));
    for i in 1:iterations
        A_mul_B!(Xw, X, w)
        @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
        Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
    end
    w
end
julia> w = ones(2); logistic_regression!(w, Y, X, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0

or

julia> using SugarBLAS

julia> function logistic_regression!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               SugarBLAS.@gemv! w -= step * X' * buffer
           end
           w
       end
logistic_regression! (generic function with 2 methods)

julia> w = ones(2); logistic_regression!(w, Y, X, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0    

julia> @benchmark logistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     42.497 ms (0.00% GC)
  median time:      44.273 ms (0.00% GC)
  mean time:        45.524 ms (0.00% GC)
  maximum time:     72.097 ms (0.00% GC)
  --------------
  samples:          110
  evals/sample:     1

That would be really nice. I’d prefer that over automatic threading, and I imagine that’d be really straightforward.

I think I’m likely to start using SugarBLAS, instead of using LinearAlgebra. That is an easy, straightforward change.
Not sure if @blas also dispatches Symmetric → symm, etc.

I agree with your emphasis on predictability.

I also haven’t had any success with ParallelAccelerator, but it has been a long time since I gave it much of a look. With things like DataFlow.jl and Cassette.jl, there’s definitely a lot of potential to develop macros that optimize / generate new functions.

ChrisRackauckas · January 28, 2018, 5:03am

What’s the updated timing BTW @tk3369?

Mason · January 28, 2018, 5:17am

I definitely agree and I could have worded my comment better. What I meant was just that often, fairly well performing Numpy code tends to look pretty elegant when its a task that Numpy was intended to do. Naïve Julia code tends to look much nicer but then once you start optimizing it, it can lose some of that prettiness.

But Julia also gives you the tools to abstract away all the ugliness into a DSL and then you get back to beautiful code.

I was really excited about Numba when I first encountered it but its a real pain when you want to do something other than a microbenchmark. As soon as I tried to integrate Numba into a real project it became this headache of sifting through the most unintelligible stack-traces I’ve ever seen, eventually giving up and then just successively removing parts of functions until I found the offending parts. Often it was something as dumb as the fact that Numba can’t handle strings at all so if you do anything involving a string in Numba code, it won’t compile and you lose all the performance. Numba speeds up an incredibly small subset of Python that I’d be pretty surprised to see it get seriously used anywhere other than very niche applications.

tk3369 · January 28, 2018, 5:44am

Performance Summary

Method	Min(ms)	Mean(ms)
Julia (original one-line version)	65	70
Julia (BLAS)	47	48
Julia (SugarBLAS)	47	50
Julia (BLAS + Yeppp.exp)	22	28
Python (Numpy + NumbaJIT)	22	22
Python (Numpy)	41	42

Julia (Original One-line Version)

julia> function logistic_regression_original(Y, X, w, iterations)
           for i in 1:iterations
               w[:] -= (((1.0 ./ (1.0 + e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
           end
           w
       end
logistic_regression_original (generic function with 1 method)

julia> @benchmark logistic_regression_original($Y, $X, $w, 1000)
BenchmarkTools.Trial: 
  memory estimate:  86.58 MiB
  allocs estimate:  10000
  --------------
  minimum time:     65.381 ms (11.83% GC)
  median time:      70.496 ms (14.63% GC)
  mean time:        72.215 ms (14.13% GC)
  maximum time:     97.503 ms (12.28% GC)
  --------------
  samples:          70
  evals/sample:     1

Julia (BLAS)

julia> function logistic_regression!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression! (generic function with 2 methods)

julia> @benchmark logistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     47.381 ms (0.00% GC)
  median time:      48.398 ms (0.00% GC)
  mean time:        49.053 ms (0.00% GC)
  maximum time:     57.163 ms (0.00% GC)
  --------------
  samples:          102
  evals/sample:     1

Julia (SugarBLAS)

julia> function logistic_regression_sugar!(w, Y, X, iterations, step = 1.0)
                  Xw = similar(w, size(X,1))
                  buffer = similar(w, length(Y));
                  for i in 1:iterations
                      A_mul_B!(Xw, X, w)
                      @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
                      SugarBLAS.@gemv! w -= step * X' * buffer
                  end
                  w
              end
logistic_regression_sugar! (generic function with 2 methods)

julia> @benchmark logistic_regression_sugar!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     47.428 ms (0.00% GC)
  median time:      49.689 ms (0.00% GC)
  mean time:        51.410 ms (0.00% GC)
  maximum time:     77.602 ms (0.00% GC)
  --------------
  samples:          98
  evals/sample:     1

Julia (BLAS + Yeppp.exp)

julia> function logistic_regression_yeppp!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               buffer = ((1.0 ./ (1.0 .+ Yeppp.exp(-Y .* Xw)) .- 1.0) .* Y)
               Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression_yeppp! (generic function with 2 methods)
julia>  @benchmark logistic_regression_yeppp!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  49.34 MiB
  allocs estimate:  4002
  --------------
  minimum time:     22.082 ms (14.64% GC)
  median time:      28.052 ms (18.37% GC)
  mean time:        29.457 ms (19.72% GC)
  maximum time:     51.326 ms (12.51% GC)
  --------------
  samples:          170
  evals/sample:     1

Python (Numpy + NumbaJIT)

In [9]: import numba
   ...: 
   ...: @numba.jit
   ...: def logistic_regression(Y, X, w, iterations):
   ...:     for i in range(iterations):
   ...:         w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
   ...:     return w
   ...: 

In [10]: %timeit logistic_regression(Y2, X2, w, 1000)
22 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python (Numpy)

In [13]: def logistic_regression2(Y, X, w, iterations):
    ...:     for i in range(iterations):
    ...:         w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    ...:     return w
    ...: 

In [14]: %timeit logistic_regression2(Y2, X2, w, 1000)
42.1 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

ChrisRackauckas · January 28, 2018, 5:48am

I see. So it is the vectorized exponential that makes a big difference here. There are native Julia LibMs being developed and we should take this into account.

tk3369 · January 28, 2018, 6:11am

Yes, it has become clear. Thanks, all!

Elrod · January 28, 2018, 6:52am

Okay, this is getting pretty ugly, but…


julia> using Yeppp

julia> function logistic_regressiony!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = - Y * Xw
               Yeppp.exp!(buffer)
               @. buffer = ((1.0 / (1.0 + buffer) - 1.0) * Y)
               SugarBLAS.@gemv! w -= step * X' * buffer
           end
           w
       end
logistic_regressiony! (generic function with 2 methods)

julia> @benchmark logistic_regressiony!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     11.529 ms (0.00% GC)
  median time:      12.197 ms (0.00% GC)
  mean time:        12.268 ms (0.00% GC)
  maximum time:     15.005 ms (0.00% GC)
  --------------
  samples:          408
  evals/sample:     1

Mind testing that?

tk3369 · January 28, 2018, 6:59am

Voila! You’ve gotten

julia> function logistic_regressiony!(w, Y, X, iterations, step = 1.0)
                  Xw = similar(w, size(X,1))
                  buffer = similar(w, length(Y));
                  for i in 1:iterations
                      A_mul_B!(Xw, X, w)
                      @. buffer = - Y * Xw
                      Yeppp.exp!(buffer)
                      @. buffer = ((1.0 / (1.0 + buffer) - 1.0) * Y)
                      SugarBLAS.@gemv! w -= step * X' * buffer
                  end
                  w
              end
logistic_regressiony! (generic function with 2 methods)

julia> @benchmark logistic_regressiony!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     13.344 ms (0.00% GC)
  median time:      13.592 ms (0.00% GC)
  mean time:        14.022 ms (0.00% GC)
  maximum time:     34.792 ms (0.00% GC)
  --------------
  samples:          357
  evals/sample:     1

Including above results:

Method	Min(ms)	Mean(ms)
Julia (original one-line version)	65	70
Julia (BLAS)	47	48
Julia (SugarBLAS)	47	50
Julia (BLAS + Yeppp.exp)	22	28
Julia (SugarBLAS + Yeppp + additional optimization)	13	14
Python (Numpy + NumbaJIT)	22	22
Python (Numpy)	41	42

ChrisRackauckas · January 28, 2018, 7:05am

I think a macro plus the proposed lazy broadcast implementation could allow for a simple way to write a statement like that with an internal cache and some in-place vectorized functions. The lazy broadcast could be a big win for this stuff with a little bit of work.

github.com/JuliaLang/julia

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation

JuliaLang:master ← JuliaLang:teh-jn/lazydotfuse

opened 09:57PM - 03 Jan 18 UTC

timholy

+1407 -864

This PR (WIP) implements broadcasting using a lazy wrapper. It's a collaborative… effort with @vtjnash, following up on the branch described [here](https://github.com/JuliaLang/julia/pull/23692#issuecomment-352487735). It's a WIP because some of the `sparse/higherorderfns` tests are allocating and/or hitting inference limits, and there may be other bugs I haven't yet addressed. Having gotten most of the way through this, I would characterize this as an attractive but dangerous change. It fixes numerous problems but also risks introducing a whole bunch of fun new bugs. For a few people it might make broadcasting harder to support than how it works now, but for others they will finally be able to use broadcasting. The overall strategy is to create `Broadcasted` wrappers, and then `copy` them (or `copyto!` for `broadcast!`) to eagerly evaluate. There's also an intermediate step of `instantiation`, which just finishes various elements of the wrapper (the indexing information). Certain packages like StaticArrays.jl won't want to use the indexing operations, so this allows them to avoid the performance hit that comes from `instantiate`. The trickiest part of all this is that `Broadcasted` wrappers can be nested, for example in `2 .* (x .+ 1)`, which leads to `Broadcasted(*, 2, Broadcasted(+, x, 1))`. In general, what this means is that users may have to walk the tree to implement broadcasting, which may not be a simple operation. To simplify this, I provide *two* approaches: (1) `Broadcasted` containers can be `flatten`ed, using pure-Julia code to perform the same fusion operations we used to do with in `julia-syntax.scm` (this is used for implementing sparse broadcasting); (2) there's support for straightforward "incremental" (eager) evaluation, starting with the deepest layers and working your way up (this is used for supporting broadcasting on `AbstractRange`s). Hopefully some combination of these two will ease most folks' pain. One immediately practical consequence is that certain operations are much harder on inference, since more operations are handled by pure-Julia code. While I haven't timed carefully, I suspect slowdowns in most or all of the following tests: - linalg/dense - sparse/sparsevector - sparse/higherorderfns - linalg/triangular Achievements: - fixes #21094 - fixes #22060 - fixes #22053 - fixes #19313 - resolves [this concern](https://github.com/JuliaLang/julia/pull/22932#issuecomment-330711997) - deprecates (via a fallback) `(1:5) + 1`. We're doing this for other `AbstractArray`s but such methods were still defined for ranges. - speeds up the `ranges.jl` test considerably, and also fixes a printing bug when run at the REPL.

xiaodai · January 28, 2018, 8:08am

wonder how is compares to logistic regression in GLM.jl

tk3369 · January 28, 2018, 6:06pm

Found a post about how a Python FFT code was uglified to gain performance. Would be interesting to see how the journey could play out in case of Julia

http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

ChrisRackauckas · January 28, 2018, 6:31pm

github.com

JuliaApproximation/FastTransforms.jl/blob/master/src/nufft.jl

"""
Pre-computes a nonuniform fast Fourier transform of type `N`.

For best performance, choose the right number of threads by `FFTW.set_num_threads(4)`, for example.
"""
struct NUFFTPlan{N,T,FFT} <: Plan{T}
    U::Matrix{T}
    V::Matrix{T}
    p::FFT
    t::Vector{Int}
    temp::Matrix{T}
    temp2::Matrix{T}
    Ones::Vector{T}
end

"""
Pre-computes a nonuniform fast Fourier transform of type I.
"""
function plan_nufft1(ω::AbstractVector{T}, ϵ::T) where {T<:AbstractFloat}
    N = length(ω)

This file has been truncated. show original

You get a lot more features than Float64 support too.

ScottPJones · January 28, 2018, 6:35pm

I hope somebody posts numbers comparing that to the final “uglified” Python+Numba code!

Nosferican · January 28, 2018, 8:41pm

GLM uses IRLS for logistic, but it can use different solvers (QR or Cholesky). Might be better to benchmark the implementation with one using Optim.jl.

xiaodai · January 28, 2018, 8:51pm

When I came to Julia I naturally wanted to try GLM coming from R background.

Elrod · January 28, 2018, 11:27pm

Vectorized Yeppp definitely improved exp a lot.
But there may be a some left on the table for serial evaluation too.

There’ve been a lot of performance optimizations released to glibc recently for single precision.
Saw this on Phoronix:
https://sourceware.org/git/?p=glibc.git;a=commit;h=fe596486d694e657413d0d4c5a04598674ff71b1
https://sourceware.org/git/?p=glibc.git;a=commit;h=ac817e083b37a5c25d05cde8bde302d7a93ffc5e

These could translate to a big gain for things like single precision ffts.

I think what follows may be an example of future performance improvements, but I’m not really sure. When I try locate [the file names from the link], I don’t see any in what look like the right places for that to be the cause, but I’m short on alternative explanations.

julia> @benchmark sin(0.23f0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.548 ns (0.00% GC)
  median time:      4.559 ns (0.00% GC)
  mean time:        4.564 ns (0.00% GC)
  maximum time:     17.363 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

So 10 million evaluations ought to take roughly 45 ms.
A function using pairwise summation to add up sin evaluated at 10 million different points, using explicit explicit SIMD vectorization (bought about 2 ms; may be a substantial chunk of the time not in sin):

julia> @benchmark integrate($serial)
BenchmarkTools.Trial: 
  memory estimate:  400 bytes
  allocs estimate:  4
  --------------
  minimum time:     43.200 ms (0.00% GC)
  median time:      43.604 ms (0.00% GC)
  mean time:        43.678 ms (0.00% GC)
  maximum time:     44.549 ms (0.00% GC)
  --------------
  samples:          115
  evals/sample:     1

Cutting to the chase, compiling translations to shared libraries (all with -Ofast -march=native -shared -fPIC), this was g++ 8.0 (trunk):

julia> @benchmark intsotest($intf)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.057 ms (0.00% GC)
  median time:      24.263 ms (0.00% GC)
  mean time:        24.336 ms (0.00% GC)
  maximum time:     25.292 ms (0.00% GC)
  --------------
  samples:          206
  evals/sample:     1

This is only happens with -Ofast. At -O3, results are comparable to those that follow – so it may just be an improved sin_fast type function.

Other C++ compilers, including older g++, and Fortran were slower than Julia:

julia> @benchmark intsotest($intf72) #g++ 7-2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.198 ms (0.00% GC)
  median time:      53.866 ms (0.00% GC)
  mean time:        54.567 ms (0.00% GC)
  maximum time:     77.023 ms (0.00% GC)
  --------------
  samples:          92
  evals/sample:     1

julia> @benchmark intsotest($intclang) #Clang++ 5.0.1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     117.949 ms (0.00% GC)
  median time:      118.069 ms (0.00% GC)
  mean time:        118.180 ms (0.00% GC)
  maximum time:     119.378 ms (0.00% GC)
  --------------
  samples:          43
  evals/sample:     1

julia> @benchmark intsotestf(intfort1) #gfortran - 8.0 trunk
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.365 ms (0.00% GC)
  median time:      51.784 ms (0.00% GC)
  mean time:        51.851 ms (0.00% GC)
  maximum time:     53.317 ms (0.00% GC)
  --------------
  samples:          97
  evals/sample:     1

julia> @benchmark intsotestf(intfort72) #gfortran 7-2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.060 ms (0.00% GC)
  median time:      52.273 ms (0.00% GC)
  mean time:        52.369 ms (0.00% GC)
  maximum time:     53.613 ms (0.00% GC)
  --------------
  samples:          96
  evals/sample:     1

g++ 8.0’s shared libraries though were a decent chunk larger:

20504 - g++ 8.0
16856 - g++ 7.2
17944 - clang++ 5.0.1
10688 - gfortran - 8.0
8704 - gfortran - 7.2

Could it be inlining sin? No idea. Something different is definitely going on.

I didn’t see any such difference for double precision.
Still not totally sure about the cause.

PS.
Compiling with fopenmp made threaded executables, but shared libraries were serial.
I’ll look into that a little more. But, using Threads.@threads in Julia:

julia> @benchmark integrate($threads)
BenchmarkTools.Trial: 
  memory estimate:  1.34 KiB
  allocs estimate:  53
  --------------
  minimum time:     3.540 ms (0.00% GC)
  median time:      4.035 ms (0.00% GC)
  mean time:        5.471 ms (0.00% GC)
  maximum time:     10.514 ms (0.00% GC)
  --------------
  samples:          915
  evals/sample:     1

Code

using SIMD
abstract type mode end
struct serial <: mode end
struct threads <: mode end
function pairint(l, n, dx::Float32, ::Type{serial})
    if n < 33
        u = l + n - 17
        sum_x = sum( Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+i) * dx) ) ) +
                     Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+8+i) * dx) ) ) )
        for n_i ∈ l:u
            sum_x = sum_x + sin( n_i * dx )
        end
    else
        n_i = n ÷ 2
        sum_x = pairint(l, n_i, dx, serial) + pairint(l + n_i, n - n_i, dx, serial)
    end
    sum_x
end

function pairint(l, n, dx, ::Any)
    if n < 20
        u = l + n - 1
        sum_x = zero(dx)
        for n_i ∈ l:u
            sum_x = sum_x + sin( n_i * dx )
        end
    else
        n_i = n ÷ 2
        sum_x = pairint(l, n_i, dx, threads) + pairint(l + n_i, n - n_i, dx, threads)
    end
    sum_x
end

function vec_sum(v)
    out = zero(eltype(v))
    for vᵢ ∈ v
        out += vᵢ
    end
    out
end

function integrate(::Type{M} = serial, ::Type{T} = Float32) where {M<:mode, T}
    n_lim = 10_000_000
    nthread = 16
    a = acos( - one(T) )
    dx = (a - zero(T)) / n_lim

    sinx_v = Vector{T}(nthread)
    nrange = Vector{Int}(nthread+1)

    sinx = T(0.5) * ( sin(zero(T)) + sin(a) )
    for i ∈ 0:nthread
        nrange[i+1] = 1 + (n_lim-1) * i ÷ nthread
    end
    
    if M == threads
        Threads.@threads for i ∈ 1:nthread
            sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
        end
    else
        for i ∈ 1:nthread
            sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
        end
    end

    sinx = dx * ( vec_sum(sinx_v) + sinx )
    sinx
end

#include <cmath>
#include <algorithm>


template<class T>
T pairint(int l, int n, T dx){
    T sum_x ;
    int n_i ;
    if ( n < 20 ){
        int u = l + n - 1 ;
        sum_x = 0.0 ;
        for( n_i = l; n_i <= u; ++n_i){
            sum_x = sum_x + std::sin( n_i * dx ) ;
        }
    } else {
        n_i = n / 2 ;
        sum_x = pairint(l, n_i, dx) + pairint(l + n_i, n - n_i, dx) ;
    }
    return sum_x ;
}

template<class T>
T vec_sum(T v[], int l){
    T sum = 0 ;
    for(int i = 0; i < l; i++){
        sum = sum + v[i] ;
    }
    return sum ;
}

extern "C" float integratef(const float, const float, const long);
extern "C" double integrated(const double, const double, const long);

template<class T>
T integrate(const T l, const T u, const long n_lim)
{

    const int       nthread = 16;
    const T         dx = (u - l) / n_lim;
    T               sinx_v[nthread];
    int             nrange[nthread+1];


    T sinx = 0.5 * ( std::sin(l) + std::sin(u) ) ;
    for (int i = 0; i <= nthread; ++i){
        nrange[i] = 1 + (n_lim-1) * i / nthread;
    }

    #pragma omp parallel for
    for (int i = 0; i < nthread; ++i){
        sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx);
    }

    sinx = dx * ( vec_sum(sinx_v, nthread) + sinx ) ;

    return sinx;
}

float integratef(const float l, const float u, const long n_lim){
    return integrate(l, u, n_lim);
}
double integrated(const double l, const double u, const long n_lim){
    return integrate(l, u, n_lim);
}

ccall wrapper:

libcpp = Libdl.dlopen("path/to/sharedlib.so")
intf = Libdl.dlsym(libcpp, :integratef)
function intsotest(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
    ccall(f, Float32, (Float32, Float32, Int64), l, u, n)
end

module integral
implicit none

integer, parameter                  ::  nthread = 16

contains
    function integrate(l, u, n_lim) result(sinx)
        real, intent(in), value             ::  l, u
        integer(8), intent(in), value       ::  n_lim
        real                                ::  dx, sinx
        real, dimension(nthread)            ::  sinx_v                                        
        integer, dimension(0:nthread)       ::  nrange
        integer                             ::  i

        dx = (u - l) / n_lim
        sinx = 0.5 * ( sin(l) + sin(u) )

        do i = 0, nthread
            nrange(i) = 1 + (n_lim-1) * i / nthread
        end do

        !$OMP parallel do
        do i = 1, nthread
            sinx_v(i) = pairint(nrange(i-1), nrange(i)-nrange(i-1), dx)
        end do
        !$OMP end parallel do

        sinx = dx * ( sinx + sum(sinx_v) )

    end function integrate


    recursive function pairint(l, n, dx) result(sum_x)
        implicit none
        integer, intent(in)                         :: l, n
        real,    intent(in)                         :: dx
        integer                                     :: n1
        real                                        :: sum_x

        if ( n < 20 ) then
            sum_x = 0.0
            do n1 = l, l + n - 1
                sum_x = sum_x + sin( n1 * dx )
            end do
        else
            n1 = n / 2
            sum_x = pairint(l, n1, dx) + pairint(l + n1, n - n1, dx)
        end if
    end function pairint
        
    
end module integral

ccall wrapper:

libfort1 = Libdl.dlopen("path/to/sharedlib.so")
intfort1 = Libdl.dlsym(libfort1, :__integral_MOD_integrate)
function intsotestf(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
    ccall(f, Float32, (Ptr{Float32}, Ptr{Float32}, Ptr{Int64}), &l, &u, &n)
end

dmbates · January 29, 2018, 8:22pm

I think the algorithm being used here is also IRLS, isn’t it? w is the weight vector which is getting updated at each iteration.

I’m not sure why the number of iterations being run is 1000. Usually IRLS converges very quickly - 4 or 5 iterations is not uncommon.

dmbates · January 29, 2018, 8:26pm

On looking more closely at the code I take back what I wrote. w is the coefficient vector.

Topic		Replies	Views
Dot product of 2D matrices New to Julia linearalgebra	3	2065	August 10, 2021
Dot product of multi-dimensional arrays New to Julia linearalgebra	5	2602	February 11, 2022
Naive dot product faster in Fortran than in Juila Performance	12	1403	July 24, 2021
I'm converting a Python/NumPy program into Julia New to Julia	17	5117	February 27, 2017
Comparing Numba and Julia for a complex matrix computation Performance question	29	3405	July 31, 2021

Dot function

Performance Summary

Related topics