1.0 annoyances and Matlab comparison

As well as, notably, eigenvalue related functions.

I guess this is a case of cultural divergence between different users. It’s impossible for me to think of linear algebra as somehow not “general purpose”.

As I’ve said, the compromise of having most stuff already available to you as operators but hiding the eigenvalues and special matrix stuff is a compromise I can definitely live with. Losing the operators would be unthinkable, and please, for the love of god, please don’t ever make us do using ComplexNumbers or do a using to get access to \pi (not that anyone has suggested this, fortunately).

1 Like

Thank you so much for your reply. So, how do you interpret these results? It’s for matrix_statistics benchmark as I said in my post with t = 100 and n = 256. I tried to be especially kind to our ancestors, because I see some guys today who claim that they beat Fortran in numeric code. I used ifort with MKL (both are free).

  JuliaPro (MKL):
  5.309406 seconds (3.33 k allocations: 3.516 GiB, 9.87% gc time)
  (0.013036323190621493, 0.013258036813688603)

  Julia (with @fastmath)
  2.301355 seconds (4.33 k allocations: 2.930 GiB, 13.66% gc time)
  (0.011631471885818373,0.011698168162769071)
  
  MATLAB (MKL):
  Elapsed time is 4.699835 seconds.
  0.013145870492325  0.014041357198792
  
  The Dinosaur (Fortran with MKL): 
  Time : 0.75700 seconds
  0.011068232  0.010917814
1 Like

No, almost all the linear algebra (*, /, BLAS, LAPACK, etc.) is in LinearAlgebra. Base includes arrays but no linear algebra.

1 Like

Since you are king of the FFT, I’ll take this opportunity and show you a piece of old code in my research which lends itself very well to vectorization. I tried too many “ugly” tricks like inplace operations, A_mul_B!, rand!, and replacing many vectorized functions with explicit loops. Finally, I got only 3X speed up with transforming the code into totally unreadable and not expressing the problem at hand.

What I mean is that choosing loopy vs. vectorized code depends on the problem at hand. In my case, I was simulating the transmission of data packets using OFDM technique. This kind of code cannot be serialized. It also allocates many temporary arrays because of vertical and horizontal concatenation of arrays.

Sorry the code is not so short:

function MyPTS(n)
    N, L, r = 64, 4, 0  
    p = [1, -1, im, -im]  # phase factor possible values
    B = complex(zeros(256,4))
    for a in p, b in p, c in p, d in p
    	r += 1
			B[r,:] = [a, b, c, d] # all possible combinations
		end
    papro        = zeros(n) 
    papr_min     = zeros(n)

    for i = 1:n   
        # Generate OFDM symbols: (N = 64)
        ofdm_symbol = rand([1+im, 1-im, -1+im, -1-im], N) 
        
        # calculate  papr of original ofdm (L = 256)
        time_signal = ifft([ofdm_symbol[1:32]; zeros((L-1)*N); ofdm_symbol[33:64]])
        meano = mean(abs2.(time_signal))
        peako = maximum(abs2.(time_signal))
        papro[i] = 10log10(peako/meano) 
        
        # Partition OFDM Symbol (L = 64)
        P1 = [ ofdm_symbol[1:16]; zeros(48) ] 
        P2 = [ zeros(16); ofdm_symbol[17:32]; zeros(32) ] 
        P3 = [ zeros(32); ofdm_symbol[33:48]; zeros(16) ] 
        P4 = [ zeros(48); ofdm_symbol[49:64] ] 
        
        # Transform Pi's to Time Domain (L = 256)
        Pt1 = ifft( [ P1[1:32]; zeros((L-1)*N); P1[33:64] ] ) 
        Pt2 = ifft( [ P2[1:32]; zeros((L-1)*N); P2[33:64] ] ) 
        Pt3 = ifft( [ P3[1:32]; zeros((L-1)*N); P3[33:64] ] ) 
        Pt4 = ifft( [ P4[1:32]; zeros((L-1)*N); P4[33:64] ] ) 

        Pt_B = B * [Pt1.'; Pt2.'; Pt3.'; Pt4.']
        Pt_B_abs = abs2.(Pt_B)
        all_means = mean(Pt_B_abs, 2)
        all_max = maximum(Pt_B_abs, 2)       
       
        all_papr = 10*log10.(all_max ./ all_means) 
        papr_min[i] = minimum(all_papr)
    end
    println(mean(papro)," ", mean(papr_min))
end

MyPTS(100) # warmup
for i = 1:3
    @time MyPTS(10_000)
end

True, my observations are also in this line, but in our case it’s not very much about the compiler as it is about using the same optimized library routines. Most of the time is spent in matrix multiplications, powers, and randn. I used the same external library functions for these. See the results in my reply to @stevengj.

Oh yeah.

https://docs.julialang.org/en/latest/stdlib/LinearAlgebra/#Linear-Algebra-1

That’s a lot more than I thought.

The new Moore’s Law of computer languages design is: "learn from the Julia developers and community how a vibrant and welcoming community around a language could be! "

Thank you so much @StefanKarpinski for your valuable replies, thank you all for the helpful discussions. It is this awesome community, which is the best part of Julia IMO.

True, but as I said, a tiny matrix like this is considered a special case, one can unroll all operations by hand to remove any overhead and obtain the best performance. Small matrices have their own design considerations, which may not apply to larger ones. See for example the latest 2018 Intel MKL library improvements on small matrices.

2 Likes

I’m a little confused. Matrix operations work without using LinearAlgebra because the operators seem to behave normally even without it. Thankfully even inv works without it. Are you saying that these are somehow not using BLAS or something like that?

EDIT: Please disregard what I say about ifort here. I should have used -fast instead of -Ofast. I will test with the correct flag later.

I’ve routinely found ifort to be a factor 2-3 faster compared to gfortran. So comparisons between Julia and Fortran are only fair if the same compiler is used.

It seems highly variable. I’ve been writing Fortran for a consulting project where the user is dedicated to MATLAB. Fortran seemed a lot more interesting, so I decided to work in that.

I’m now on a computer with a 3.6 GHz i7:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libimf
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Now…

$ ifort -Ofast -xHost -shared -fPIC cholesky.F90 normal.F90 -o foster.so
$ gfortran-7 -Ofast -march=native -shared -fPIC cholesky.F90 normal.F90 -o gfoster.so
julia> using Compat, BenchmarkTools

julia> libfort1 = Libdl.dlopen("/home/celrod/Documents/ProbCol/foster.so");
julia> intfort1 = Libdl.dlsym(libfort1, :probcol_mp_pc2dfoster_circle_);
julia> libgfort1 = Libdl.dlopen("/home/celrod/Documents/ProbCol/gfoster.so");
julia> gfort1 = Libdl.dlsym(libgfort1, :__probcol_MOD_pc2dfoster_circle);

julia> function intsotestf!(Pc::Ref{Float64}, f,r1,v1,cov1,r2,v2,cov2,HBR::Ref{Float64})
           ccall(f, Cvoid, (Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}, Ref{Float64}), Pc, r1, v1, cov1, r2, v2, cov2, HBR)
           Pc
       end
intsotestf! (generic function with 1 method)

###Defined inputs

julia> intsotestf!(Pc, intfort1, r1, v1, cov1, r2, v2, cov2, rHBR)
Base.RefValue{Float64}(2.7060234765689587e-5)

julia> intsotestf!(Pc, gfort1, r1, v1, cov1, r2, v2, cov2, rHBR)
Base.RefValue{Float64}(2.706023476568956e-5)

julia> @benchmark intsotestf!($Pc, $intfort1, $r1, $v1, $cov1, $r2, $v2, $cov2, $rHBR)
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     1.831 μs (0.00% GC)
  median time:      1.840 μs (0.00% GC)
  mean time:        1.874 μs (0.00% GC)
  maximum time:     3.204 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark intsotestf!($Pc, $gfort1, $r1, $v1, $cov1, $r2, $v2, $cov2, $rHBR)
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     631.165 ns (0.00% GC)
  median time:      635.118 ns (0.00% GC)
  mean time:        651.982 ns (1.55% GC)
  maximum time:     8.420 μs (87.74% GC)
  --------------
  samples:          10000
  evals/sample:     170

I also translated all the code to Julia:

julia> pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)
2.7060234765689807e-5

julia> @benchmark pc2dfoster_circle($r1,$v1,$cov1,$r2,$v2,$cov2,$HBR)
BenchmarkTools.Trial: 
  memory estimate:  672 bytes
  allocs estimate:  11
  --------------
  minimum time:     1.835 μs (0.00% GC)
  median time:      1.873 μs (0.00% GC)
  mean time:        1.986 μs (1.25% GC)
  maximum time:     135.751 μs (92.27% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark erf(2.0)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     44.628 ns (0.00% GC)
  median time:      44.743 ns (0.00% GC)
  mean time:        46.014 ns (0.00% GC)
  maximum time:     98.411 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

Julia built with libopenlibm and OpenBLAS:

julia> pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)
2.7060234765689814e-5

julia> @benchmark pc2dfoster_circle($r1,$v1,$cov1,$r2,$v2,$cov2,$HBR)
BenchmarkTools.Trial: 
  memory estimate:  672 bytes
  allocs estimate:  11
  --------------
  minimum time:     700.407 ns (0.00% GC)
  median time:      714.893 ns (0.00% GC)
  mean time:        768.352 ns (3.42% GC)
  maximum time:     8.002 μs (81.71% GC)
  --------------
  samples:          10000
  evals/sample:     145

julia> @benchmark erf(2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.302 ns (0.00% GC)
  median time:      34.449 ns (0.00% GC)
  mean time:        37.365 ns (0.00% GC)
  maximum time:     115.258 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

Performance is broadly comparable.
My Julia implementation was a direct translation, without any extra decorators for performance like @fastmath (turned on in both Fortran compilations with Ofast) or @inbounds (Fortran does not check bounds by default).

using SpecialFunctions
function gdensity(w, hbr2=1.0, x0=2.0, u11=1.9979, u12=-0.230327, u22=2.61677)
    nu12w = -u12*w
    radical = sqrt(hbr2 - abs2(u11*w-x0))
    exp(-abs2(w)/2)*( erf((nu12w+radical)*u22) - erf((nu12w-radical)*u22) )
end
gdensity(w, hbr2, x0, u::AbstractArray) = gdensity(w, hbr2, x0, u[1], u[2], u[3])
function gaussianarea(hbr, x0, U)
    n = ( 0.09226835946330202, 0.273662990072083, 0.4457383557765383, 0.6026346363792564, 0.7390089172206591, 0.8502171357296142, 0.9324722294043558, 0.9829730996839018 )
    w = ( 0.036704932945846216,0.035454990477090234,0.032997670831211, 0.029416655081518854,0.02483389042441457,0.01940543741387547, 0.01331615550646369,0.006773407894247478 )
    hbr2 = abs2(hbr)
    s = hbr / U[1]
    mid = x0 / U[1]
    ga = w[1] * ( gdensity(mid-s*n[1], hbr2,x0,U) + gdensity(mid+s*n[1], hbr2,x0,U) )    
    for i = 2:length(n)
        ga += w[i] * ( gdensity(mid-s*n[i], hbr2,x0,U) + gdensity(mid+s*n[i], hbr2,x0,U) )
    end
    ga * s
end
function normalize!(x)
    x0 = norm(x)
    x ./= x0
end
function pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,hbr)
    Cp = r1 .- r2
    x0 = norm(Cp)
    y = v1 - v2
    z = cross(Cp,y)
    normalize!(y)
    normalize!(z)   
    combinecov!(Cp, cov1, cov2, y, z)
    cholesky2x2!(Cp)
    Cp[3] = 1/(Cp[3]*√2)
    gaussianarea(hbr, x0, Cp)
end
function invert2x2triangle!(Ui, U)
    t11 = 1 / U[1]
    t12 = - U[2] * t11 / U[3]
    Ui[1] = t11
    Ui[2] = t12
    Ui[3] = 1 / U[3]
    Ui
end
function combinecov!(Cp, cov1, cov2, y, z)
    cov = cov1 .+ cov2

    x = cross(y, z)
    a = x[1]*cov[1] + x[2]*cov[2] + x[3]*cov[4]
    b = x[1]*cov[2] + x[2]*cov[3] + x[3]*cov[5]
    c = x[1]*cov[4] + x[2]*cov[5] + x[3]*cov[6]

    Cp[1] = a*x[1] + b*x[2] + c*x[3]
    Cp[2] = a*z[1] + b*z[2] + c*z[3]

    a = z[1]^2*cov[1] + 2*z[1]*z[2]*cov[2] + 2*z[1]*z[3]*cov[4] 
    b = z[2]^2*cov[3] + 2*z[2]*z[3]*cov[5] + z[3]^2*cov[6]

    Cp[3] = a + b
end
function cholesky2x2!(S)
    s11 = sqrt(S[1])
    t12 = S[2] / s11

    S[1] = s11
    S[2] = t12
    S[3] = sqrt( S[3] - t12^2 )
end
###data
r1      = [378.39559, 4305.721887, 5752.767554];
v1      = [2.360800244, 5.580331936, -4.322349039];
r2      = [374.5180598, 4307.560983, 5751.130418];
v2      = [-5.388125081, -3.946827739, 3.322820358];
cov1 = [44.5757544811362, 81.6751751052616, 158.453402956163, -67.8687662707124, -128.616921644857, 105.490542562701];
cov2 = [2.31067077720423, 1.69905293875632, 1.24957388457206, -1.4170164577661, -1.04174164279599, 0.869260558223714];
HBR     = 0.020;
pc2dfoster_circle(r1,v1,cov1,r2,v2,cov2,HBR)

I would be happy to share the Fotran code too if anyone is willing to look over it.
Maybe I’m missing something, and making blunders I’m unaware of. But Julia seems broadly comparable.

No, all the standard matrix multiplication implementations are in stdlib/LinearAlgebra/src/matmul.jl:

julia> @which A*A
*(A::Union{AbstractArray{T,1}, AbstractArray{T,2}} where T, B::Union{AbstractArray{T,1}, AbstractArray{T,2}} where T) in LinearAlgebra at /Users/solver/Projects/julia7-branch/usr/share/julia/site/v0.7/LinearAlgebra/src/matmul.jl:59

I think the confusion is that Base still loads LinearAlgebra, just not into Main’s namespace. So calling using LinearAlgebra just loads the exported variables into Main.

1 Like

Before anyone asks what the point of this is, it allows for moving more of the implementation out of base Julia in the future. Since people must explicitly load LinearAlgebra their code won’t break when we do that.

5 Likes

The great thing about Julia is that I don’t have to do it manually, just use

and in most cases I can leave my code generic.

1 Like

Regarding uninitialized, my complaint here is that I feel Julia is deviating from being a user-friendly language to being a developer-friendly one. In all languages that I’m aware of, declaring an array means that it is uninitialized by default. I’d have hoped Julia was able to say A = Array(m,n) which means double precision mxn array by default and the type can be an optional parameter, or simply A = Float64{m,n}, and v = Int{n}, with the obvious meaning of both.

I mean, simplicity is king, since almost all programs contain these declarations, we should try very hard to come up with the most elegant syntax possible for such constructs. I know the hard technical challenges facing these choices ( long discussions on github ), but if this is the last resort, may be saying nil or none or undef can be less wordy than uninitialized (Julia has nothing already). I don’t have a solid proposal, but this issue can be further discussed before it’s too late.

So, why B = A * 2 is allowed, doesn’t this mean inconsistency? MATLAB for example says “addition and subtraction are the same for arrays and matrices, but that multiplicative operations are different” (MATLAB Primer Page 55 (2-11)). And .+ , .- are not allowed.

Re array constructors, I think this is only about the Array constructors, which are not supposed to be called directly in most cases ; you just do fill(0.0, m, n) or similar.

Regarding 2*A, there’s a fundamental ambiguity between arrays as containers and arrays as linear algebra objects. I think the general plan is that non-dotted operators are for algebraic operations, and dotted operators are for container operations. A + 1 does not make algebraic sense, 2*A does.

3 Likes
julia> const nil = uninitialized
Uninitialized()

julia> Array{Int}(nil, 1)
1-element Array{Int64,1}:
 140550451331952
2 Likes

I think the important point that’s missing in this discussion is that uninitialized is supposed to be a bit unwieldy, because you’re not supposed to be using it, unless you know what you’re doing. The best way to initialize an array is to pick whatever value makes sense for you (e.g. fill(0, m, n)).

8 Likes

This makes perfect sense to me, but as a new user to Julia using this method as sort of the default technique to initialize an array would not have occurred to me. If that is the intended usage perhaps introductory material should make that more clear.

There isn’t always such a readily available value when your function operates on generic types.

I consider uninitialized a wart that resulted from a compromise. I will learn to live with it, but I am not sure that we should come up with ex post rationalizations along the lines that it is ugly because not using it is somehow good for you. I counted 200+ uses in Base, and I am sure there will be thousands in packages.

5 Likes

Perhaps we should rename it to junk after all.

7 Likes

There’s nothing ex post about that rationalization. We spent many hours on triage talking about the name, and the fact that you’re not supposed to be really using it was absolutely considered (because if you were it would be too long). With respect to it showing up all over the place in base, that’s not particularly surprising because

a) It used to be the default in 0.6, so doing anything else would have required auditing every use of the array constructor to see if there was something better to do there (or if it was irrelevant to performance), which is probably a good thing to do, but wasn’t worth the time for the 1.0 cycle

and

b) Base has a lot of performance sensitive code and the people working on it fall squarely in the “know what they’re doing” category.

Frankly, we wouldn’t have any way of getting uninitialized memory at all if doing zero initialization by default wasn’t such a performance problem.

8 Likes