Dot function

I tried the updated code but it’s not returning the correct result. I’m not familiar with BLAS and can’t see what’s the issue right away… Just FYI. I may work on it more later tonight after dinner :wink:

julia> function logistic_regression(Y, X, w, iterations, step = -1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', 1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression (generic function with 2 methods)

julia> w = ones(2); logistic_regression(Y, X, w, 1000)
2-element Array{Float64,1}:
 -75254.3 
  -4665.47

The correct result should be as follows:

julia> function logistic_regression(Y, X, w, iterations)
           for i in 1:iterations
               w -= (((1.0 ./ (1.0 + exp.(-Y .* (X * w))) - 1.0) .* Y)' * X)'
           end
           w
       end
logistic_regression (generic function with 2 methods)

julia> w = ones(2); logistic_regression(Y, X, w, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0    
1 Like

There’s definitely some things that can be added. I’ve recommended before that @. adds @views so that way @. a .= b[1:5] + c[6:10] is allocation-free (well, on v0.7 with the view optimizations). As for doing something with internal matrix multiplications, it would require pulling a temporary array from somewhere. Something like ParallelAccelerator.jl places the macro on the whole function so it can do those kinds of optimizations, but it’s not a single-line thing.

But IMO I prefer for that to be explicit, because then you see things like, you really should be passing the cache variables here if you’re calling this in a loop.

Oops, okay. First thing I noticed:

julia> (((1.0 ./ (1.0 .+ e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
2-element Array{Float64,1}:
 -75255.3 
  -4666.47

Fixed function:

function logistic_regression!(w, Y, X, iterations, step = 1.0)
    Xw = similar(w, size(X,1))
    buffer = similar(w, length(Y));
    for i in 1:iterations
        A_mul_B!(Xw, X, w)
        @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
        Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
    end
    w
end
julia> w = ones(2); logistic_regression!(w, Y, X, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0   

or

julia> using SugarBLAS

julia> function logistic_regression!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               SugarBLAS.@gemv! w -= step * X' * buffer
           end
           w
       end
logistic_regression! (generic function with 2 methods)

julia> w = ones(2); logistic_regression!(w, Y, X, 1000)
2-element Array{Float64,1}:
 1.00003
 1.0    

julia> @benchmark logistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     42.497 ms (0.00% GC)
  median time:      44.273 ms (0.00% GC)
  mean time:        45.524 ms (0.00% GC)
  maximum time:     72.097 ms (0.00% GC)
  --------------
  samples:          110
  evals/sample:     1

That would be really nice. I’d prefer that over automatic threading, and I imagine that’d be really straightforward.

I think I’m likely to start using SugarBLAS, instead of using LinearAlgebra. That is an easy, straightforward change.
Not sure if @blas also dispatches Symmetricsymm, etc.

I agree with your emphasis on predictability.

I also haven’t had any success with ParallelAccelerator, but it has been a long time since I gave it much of a look. With things like DataFlow.jl and Cassette.jl, there’s definitely a lot of potential to develop macros that optimize / generate new functions.

What’s the updated timing BTW @tk3369?

I definitely agree and I could have worded my comment better. What I meant was just that often, fairly well performing Numpy code tends to look pretty elegant when its a task that Numpy was intended to do. Naïve Julia code tends to look much nicer but then once you start optimizing it, it can lose some of that prettiness.

But Julia also gives you the tools to abstract away all the ugliness into a DSL and then you get back to beautiful code.

I was really excited about Numba when I first encountered it but its a real pain when you want to do something other than a microbenchmark. As soon as I tried to integrate Numba into a real project it became this headache of sifting through the most unintelligible stack-traces I’ve ever seen, eventually giving up and then just successively removing parts of functions until I found the offending parts. Often it was something as dumb as the fact that Numba can’t handle strings at all so if you do anything involving a string in Numba code, it won’t compile and you lose all the performance. Numba speeds up an incredibly small subset of Python that I’d be pretty surprised to see it get seriously used anywhere other than very niche applications.

Performance Summary

Method Min(ms) Mean(ms)
Julia (original one-line version) 65 70
Julia (BLAS) 47 48
Julia (SugarBLAS) 47 50
Julia (BLAS + Yeppp.exp) 22 28
Python (Numpy + NumbaJIT) 22 22
Python (Numpy) 41 42

Julia (Original One-line Version)

julia> function logistic_regression_original(Y, X, w, iterations)
           for i in 1:iterations
               w[:] -= (((1.0 ./ (1.0 + e .^ (-Y .* (X * w))) - 1.0) .* Y)' * X)'
           end
           w
       end
logistic_regression_original (generic function with 1 method)

julia> @benchmark logistic_regression_original($Y, $X, $w, 1000)
BenchmarkTools.Trial: 
  memory estimate:  86.58 MiB
  allocs estimate:  10000
  --------------
  minimum time:     65.381 ms (11.83% GC)
  median time:      70.496 ms (14.63% GC)
  mean time:        72.215 ms (14.13% GC)
  maximum time:     97.503 ms (12.28% GC)
  --------------
  samples:          70
  evals/sample:     1

Julia (BLAS)

julia> function logistic_regression!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
               Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression! (generic function with 2 methods)

julia> @benchmark logistic_regression!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     47.381 ms (0.00% GC)
  median time:      48.398 ms (0.00% GC)
  mean time:        49.053 ms (0.00% GC)
  maximum time:     57.163 ms (0.00% GC)
  --------------
  samples:          102
  evals/sample:     1

Julia (SugarBLAS)

julia> function logistic_regression_sugar!(w, Y, X, iterations, step = 1.0)
                  Xw = similar(w, size(X,1))
                  buffer = similar(w, length(Y));
                  for i in 1:iterations
                      A_mul_B!(Xw, X, w)
                      @. buffer = ((1.0 / (1.0 + exp(-Y * Xw)) - 1.0) * Y)
                      SugarBLAS.@gemv! w -= step * X' * buffer
                  end
                  w
              end
logistic_regression_sugar! (generic function with 2 methods)

julia> @benchmark logistic_regression_sugar!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     47.428 ms (0.00% GC)
  median time:      49.689 ms (0.00% GC)
  mean time:        51.410 ms (0.00% GC)
  maximum time:     77.602 ms (0.00% GC)
  --------------
  samples:          98
  evals/sample:     1

Julia (BLAS + Yeppp.exp)

julia> function logistic_regression_yeppp!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               buffer = ((1.0 ./ (1.0 .+ Yeppp.exp(-Y .* Xw)) .- 1.0) .* Y)
               Base.LinAlg.BLAS.gemv!('T', -1.0, X, buffer, step, w)
           end
           w
       end
logistic_regression_yeppp! (generic function with 2 methods)
julia>  @benchmark logistic_regression_yeppp!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  49.34 MiB
  allocs estimate:  4002
  --------------
  minimum time:     22.082 ms (14.64% GC)
  median time:      28.052 ms (18.37% GC)
  mean time:        29.457 ms (19.72% GC)
  maximum time:     51.326 ms (12.51% GC)
  --------------
  samples:          170
  evals/sample:     1

Python (Numpy + NumbaJIT)

In [9]: import numba
   ...: 
   ...: @numba.jit
   ...: def logistic_regression(Y, X, w, iterations):
   ...:     for i in range(iterations):
   ...:         w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
   ...:     return w
   ...: 

In [10]: %timeit logistic_regression(Y2, X2, w, 1000)
22 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python (Numpy)

In [13]: def logistic_regression2(Y, X, w, iterations):
    ...:     for i in range(iterations):
    ...:         w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    ...:     return w
    ...: 

In [14]: %timeit logistic_regression2(Y2, X2, w, 1000)
42.1 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1 Like

I see. So it is the vectorized exponential that makes a big difference here. There are native Julia LibMs being developed and we should take this into account.

Yes, it has become clear. Thanks, all!

Okay, this is getting pretty ugly, but…


julia> using Yeppp

julia> function logistic_regressiony!(w, Y, X, iterations, step = 1.0)
           Xw = similar(w, size(X,1))
           buffer = similar(w, length(Y));
           for i in 1:iterations
               A_mul_B!(Xw, X, w)
               @. buffer = - Y * Xw
               Yeppp.exp!(buffer)
               @. buffer = ((1.0 / (1.0 + buffer) - 1.0) * Y)
               SugarBLAS.@gemv! w -= step * X' * buffer
           end
           w
       end
logistic_regressiony! (generic function with 2 methods)

julia> @benchmark logistic_regressiony!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     11.529 ms (0.00% GC)
  median time:      12.197 ms (0.00% GC)
  mean time:        12.268 ms (0.00% GC)
  maximum time:     15.005 ms (0.00% GC)
  --------------
  samples:          408
  evals/sample:     1

Mind testing that?

Voila! You’ve gotten :aplus:

julia> function logistic_regressiony!(w, Y, X, iterations, step = 1.0)
                  Xw = similar(w, size(X,1))
                  buffer = similar(w, length(Y));
                  for i in 1:iterations
                      A_mul_B!(Xw, X, w)
                      @. buffer = - Y * Xw
                      Yeppp.exp!(buffer)
                      @. buffer = ((1.0 / (1.0 + buffer) - 1.0) * Y)
                      SugarBLAS.@gemv! w -= step * X' * buffer
                  end
                  w
              end
logistic_regressiony! (generic function with 2 methods)

julia> @benchmark logistic_regressiony!($w, $Y, $X, 1000)
BenchmarkTools.Trial: 
  memory estimate:  25.25 KiB
  allocs estimate:  2
  --------------
  minimum time:     13.344 ms (0.00% GC)
  median time:      13.592 ms (0.00% GC)
  mean time:        14.022 ms (0.00% GC)
  maximum time:     34.792 ms (0.00% GC)
  --------------
  samples:          357
  evals/sample:     1

Including above results:

Method Min(ms) Mean(ms)
Julia (original one-line version) 65 70
Julia (BLAS) 47 48
Julia (SugarBLAS) 47 50
Julia (BLAS + Yeppp.exp) 22 28
Julia (SugarBLAS + Yeppp + additional optimization) 13 14
Python (Numpy + NumbaJIT) 22 22
Python (Numpy) 41 42
5 Likes

I think a macro plus the proposed lazy broadcast implementation could allow for a simple way to write a statement like that with an internal cache and some in-place vectorized functions. The lazy broadcast could be a big win for this stuff with a little bit of work.

1 Like

wonder how is compares to logistic regression in GLM.jl

Found a post about how a Python FFT code was uglified to gain performance. Would be interesting to see how the journey could play out in case of Julia :wink:

http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

1 Like

You get a lot more features than Float64 support too.

1 Like

I hope somebody posts numbers comparing that to the final “uglified” Python+Numba code!

GLM uses IRLS for logistic, but it can use different solvers (QR or Cholesky). Might be better to benchmark the implementation with one using Optim.jl.

When I came to Julia I naturally wanted to try GLM coming from R background.

Vectorized Yeppp definitely improved exp a lot.
But there may be a some left on the table for serial evaluation too.

There’ve been a lot of performance optimizations released to glibc recently for single precision.
Saw this on Phoronix:
https://sourceware.org/git/?p=glibc.git;a=commit;h=fe596486d694e657413d0d4c5a04598674ff71b1
https://sourceware.org/git/?p=glibc.git;a=commit;h=ac817e083b37a5c25d05cde8bde302d7a93ffc5e

These could translate to a big gain for things like single precision ffts.

I think what follows may be an example of future performance improvements, but I’m not really sure. When I try locate [the file names from the link], I don’t see any in what look like the right places for that to be the cause, but I’m short on alternative explanations.

julia> @benchmark sin(0.23f0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.548 ns (0.00% GC)
  median time:      4.559 ns (0.00% GC)
  mean time:        4.564 ns (0.00% GC)
  maximum time:     17.363 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

So 10 million evaluations ought to take roughly 45 ms.
A function using pairwise summation to add up sin evaluated at 10 million different points, using explicit explicit SIMD vectorization (bought about 2 ms; may be a substantial chunk of the time not in sin):

julia> @benchmark integrate($serial)
BenchmarkTools.Trial: 
  memory estimate:  400 bytes
  allocs estimate:  4
  --------------
  minimum time:     43.200 ms (0.00% GC)
  median time:      43.604 ms (0.00% GC)
  mean time:        43.678 ms (0.00% GC)
  maximum time:     44.549 ms (0.00% GC)
  --------------
  samples:          115
  evals/sample:     1

Cutting to the chase, compiling translations to shared libraries (all with -Ofast -march=native -shared -fPIC), this was g++ 8.0 (trunk):

julia> @benchmark intsotest($intf)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.057 ms (0.00% GC)
  median time:      24.263 ms (0.00% GC)
  mean time:        24.336 ms (0.00% GC)
  maximum time:     25.292 ms (0.00% GC)
  --------------
  samples:          206
  evals/sample:     1

This is only happens with -Ofast. At -O3, results are comparable to those that follow – so it may just be an improved sin_fast type function.

Other C++ compilers, including older g++, and Fortran were slower than Julia:

julia> @benchmark intsotest($intf72) #g++ 7-2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.198 ms (0.00% GC)
  median time:      53.866 ms (0.00% GC)
  mean time:        54.567 ms (0.00% GC)
  maximum time:     77.023 ms (0.00% GC)
  --------------
  samples:          92
  evals/sample:     1

julia> @benchmark intsotest($intclang) #Clang++ 5.0.1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     117.949 ms (0.00% GC)
  median time:      118.069 ms (0.00% GC)
  mean time:        118.180 ms (0.00% GC)
  maximum time:     119.378 ms (0.00% GC)
  --------------
  samples:          43
  evals/sample:     1

julia> @benchmark intsotestf(intfort1) #gfortran - 8.0 trunk
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.365 ms (0.00% GC)
  median time:      51.784 ms (0.00% GC)
  mean time:        51.851 ms (0.00% GC)
  maximum time:     53.317 ms (0.00% GC)
  --------------
  samples:          97
  evals/sample:     1

julia> @benchmark intsotestf(intfort72) #gfortran 7-2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.060 ms (0.00% GC)
  median time:      52.273 ms (0.00% GC)
  mean time:        52.369 ms (0.00% GC)
  maximum time:     53.613 ms (0.00% GC)
  --------------
  samples:          96
  evals/sample:     1

g++ 8.0’s shared libraries though were a decent chunk larger:

20504 - g++ 8.0
16856 - g++ 7.2
17944 - clang++ 5.0.1
10688 - gfortran - 8.0
8704 - gfortran - 7.2

Could it be inlining sin? No idea. Something different is definitely going on.

I didn’t see any such difference for double precision.
Still not totally sure about the cause.

PS.
Compiling with fopenmp made threaded executables, but shared libraries were serial.
I’ll look into that a little more. But, using Threads.@threads in Julia:

julia> @benchmark integrate($threads)
BenchmarkTools.Trial: 
  memory estimate:  1.34 KiB
  allocs estimate:  53
  --------------
  minimum time:     3.540 ms (0.00% GC)
  median time:      4.035 ms (0.00% GC)
  mean time:        5.471 ms (0.00% GC)
  maximum time:     10.514 ms (0.00% GC)
  --------------
  samples:          915
  evals/sample:     1

Code

using SIMD
abstract type mode end
struct serial <: mode end
struct threads <: mode end
function pairint(l, n, dx::Float32, ::Type{serial})
    if n < 33
        u = l + n - 17
        sum_x = sum( Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+i) * dx) ) ) +
                     Vec{8,Float32}( ( Base.Cartesian.@ntuple 8 i -> sin((u+8+i) * dx) ) ) )
        for n_i ∈ l:u
            sum_x = sum_x + sin( n_i * dx )
        end
    else
        n_i = n ÷ 2
        sum_x = pairint(l, n_i, dx, serial) + pairint(l + n_i, n - n_i, dx, serial)
    end
    sum_x
end

function pairint(l, n, dx, ::Any)
    if n < 20
        u = l + n - 1
        sum_x = zero(dx)
        for n_i ∈ l:u
            sum_x = sum_x + sin( n_i * dx )
        end
    else
        n_i = n ÷ 2
        sum_x = pairint(l, n_i, dx, threads) + pairint(l + n_i, n - n_i, dx, threads)
    end
    sum_x
end

function vec_sum(v)
    out = zero(eltype(v))
    for vᵢ ∈ v
        out += vᵢ
    end
    out
end

function integrate(::Type{M} = serial, ::Type{T} = Float32) where {M<:mode, T}
    n_lim = 10_000_000
    nthread = 16
    a = acos( - one(T) )
    dx = (a - zero(T)) / n_lim

    sinx_v = Vector{T}(nthread)
    nrange = Vector{Int}(nthread+1)

    sinx = T(0.5) * ( sin(zero(T)) + sin(a) )
    for i ∈ 0:nthread
        nrange[i+1] = 1 + (n_lim-1) * i ÷ nthread
    end
    
    if M == threads
        Threads.@threads for i ∈ 1:nthread
            sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
        end
    else
        for i ∈ 1:nthread
            sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx, M)
        end
    end

    sinx = dx * ( vec_sum(sinx_v) + sinx )
    sinx
end
#include <cmath>
#include <algorithm>


template<class T>
T pairint(int l, int n, T dx){
    T sum_x ;
    int n_i ;
    if ( n < 20 ){
        int u = l + n - 1 ;
        sum_x = 0.0 ;
        for( n_i = l; n_i <= u; ++n_i){
            sum_x = sum_x + std::sin( n_i * dx ) ;
        }
    } else {
        n_i = n / 2 ;
        sum_x = pairint(l, n_i, dx) + pairint(l + n_i, n - n_i, dx) ;
    }
    return sum_x ;
}

template<class T>
T vec_sum(T v[], int l){
    T sum = 0 ;
    for(int i = 0; i < l; i++){
        sum = sum + v[i] ;
    }
    return sum ;
}

extern "C" float integratef(const float, const float, const long);
extern "C" double integrated(const double, const double, const long);

template<class T>
T integrate(const T l, const T u, const long n_lim)
{

    const int       nthread = 16;
    const T         dx = (u - l) / n_lim;
    T               sinx_v[nthread];
    int             nrange[nthread+1];


    T sinx = 0.5 * ( std::sin(l) + std::sin(u) ) ;
    for (int i = 0; i <= nthread; ++i){
        nrange[i] = 1 + (n_lim-1) * i / nthread;
    }

    #pragma omp parallel for
    for (int i = 0; i < nthread; ++i){
        sinx_v[i] = pairint(nrange[i], nrange[i+1]-nrange[i], dx);
    }

    sinx = dx * ( vec_sum(sinx_v, nthread) + sinx ) ;

    return sinx;
}

float integratef(const float l, const float u, const long n_lim){
    return integrate(l, u, n_lim);
}
double integrated(const double l, const double u, const long n_lim){
    return integrate(l, u, n_lim);
}

ccall wrapper:

libcpp = Libdl.dlopen("path/to/sharedlib.so")
intf = Libdl.dlsym(libcpp, :integratef)
function intsotest(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
    ccall(f, Float32, (Float32, Float32, Int64), l, u, n)
end
module integral
implicit none

integer, parameter                  ::  nthread = 16

contains
    function integrate(l, u, n_lim) result(sinx)
        real, intent(in), value             ::  l, u
        integer(8), intent(in), value       ::  n_lim
        real                                ::  dx, sinx
        real, dimension(nthread)            ::  sinx_v                                        
        integer, dimension(0:nthread)       ::  nrange
        integer                             ::  i

        dx = (u - l) / n_lim
        sinx = 0.5 * ( sin(l) + sin(u) )

        do i = 0, nthread
            nrange(i) = 1 + (n_lim-1) * i / nthread
        end do

        !$OMP parallel do
        do i = 1, nthread
            sinx_v(i) = pairint(nrange(i-1), nrange(i)-nrange(i-1), dx)
        end do
        !$OMP end parallel do

        sinx = dx * ( sinx + sum(sinx_v) )

    end function integrate


    recursive function pairint(l, n, dx) result(sum_x)
        implicit none
        integer, intent(in)                         :: l, n
        real,    intent(in)                         :: dx
        integer                                     :: n1
        real                                        :: sum_x

        if ( n < 20 ) then
            sum_x = 0.0
            do n1 = l, l + n - 1
                sum_x = sum_x + sin( n1 * dx )
            end do
        else
            n1 = n / 2
            sum_x = pairint(l, n1, dx) + pairint(l + n1, n - n1, dx)
        end if
    end function pairint
        
    
end module integral

ccall wrapper:

libfort1 = Libdl.dlopen("path/to/sharedlib.so")
intfort1 = Libdl.dlsym(libfort1, :__integral_MOD_integrate)
function intsotestf(f, l::Float32 = 0f0, u::Float32 = Float32(π), n::Int64 = 10_000_000)
    ccall(f, Float32, (Ptr{Float32}, Ptr{Float32}, Ptr{Int64}), &l, &u, &n)
end
2 Likes

I think the algorithm being used here is also IRLS, isn’t it? w is the weight vector which is getting updated at each iteration.

I’m not sure why the number of iterations being run is 1000. Usually IRLS converges very quickly - 4 or 5 iterations is not uncommon.

On looking more closely at the code I take back what I wrote. w is the coefficient vector.