Parallelizing for loop in the computation of a gradient

question

#1

I’ve ported some code from a Cython implementation that may be found here into Julia. This code is parallelized and currently runs ~0.4x faster on my machine compared to my best Julia implementation.

The function that is mostly called can be found here. The routine has a for loop in line 46 that is the most expensive (taking most of the computational time of the function). This loop usually has thousands or millions of iterations. I believe it can be parallelized, but using @parallel or Threads.@threads has resulted in speeds that can be a lot slower.

Some bench-marking shows me the following:

Non-parallelized code
BenchmarkTools.Trial:
memory estimate: 1.09 MiB
allocs estimate: 11

minimum time: 254.846 ms (0.00% GC)
median time: 282.302 ms (0.00% GC)
mean time: 285.344 ms (0.00% GC)
maximum time: 319.025 ms (0.00% GC)

samples: 18
evals/sample: 1

Threads.@threads with JULIA_NUM_THREADS = 2 * NUM_CORES
BenchmarkTools.Trial:
memory estimate: 675.00 MiB
allocs estimate: 39622415

minimum time: 4.212 s (0.00% GC)
median time: 4.459 s (0.00% GC)
mean time: 4.459 s (0.00% GC)
maximum time: 4.707 s (0.00% GC)

samples: 2
evals/sample: 1

@parallel with nprocs() = NUM_CORES
BenchmarkTools.Trial:
memory estimate: 1.11 MiB
allocs estimate: 390

minimum time: 7.044 s (0.00% GC)
median time: 7.044 s (0.00% GC)
mean time: 7.044 s (0.00% GC)
maximum time: 7.044 s (0.00% GC)

samples: 1
evals/sample: 1

Here is the implementation with Threads.@threads
Here is the implementation with @parallel

Any ideas on how to speed up the code and/or properly use @parallel or Threads.@threads would be appreciated!

To run the code in the gists, you can use the following data and parameters:

data =  reshape([0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.4466; 0.42653; 0.40647; 0.3864; 0.36633; 0.34627; 0.3262; 0.30613; 0.28606; 0.26667; 0.26667; 0.26667; 0.26667; 0.38788; 0.5697; 0.75152; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.74545; 0.56364; 0.38182; 0.2747; 0.2988; 0.32289; 0.34699; 0.37108; 0.39518; 0.41928; 0.44337; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.45552; 0.42207; 0.38863; 0.35518; 0.32174; 0.28829; 0.25485; 0.2214; 0.18796; 0.15452; 0.16027; 0.24108; 0.32189; 0.4; 0.50101; 0.60202; 0.70303; 0.69313; 0.63283; 0.57253; 0.51223; 0.45193; 0.39162; 0.33333; 0.31313; 0.29293; 0.27273; 0.26131; 0.25328; 0.24525; 0.23722; 0.22918; 0.22115; 0.21312; 0.20509; 0.22694; 0.30774; 0.38855; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.37239; 0.23098; 0.089562; 0.015608; 0.06243; 0.10925; 0.15608; 0.2029; 0.24972; 0.29654; 0.34337; 0.39019; 0.43701; 0.46667; 0.46667; 0.46667; 0.46667; 0.43322; 0.39978; 0.36633; 0.33289; 0.29944; 0.266; 0.23255; 0.19911; 0.16566; 0.13333; 0.21414; 0.29495; 0.37576; 0.46734; 0.56835; 0.66936; 0.71323; 0.65293; 0.59263; 0.53233; 0.47203; 0.41173; 0.35142; 0.31987; 0.29966; 0.27946; 0.26399; 0.25596; 0.24793; 0.23989; 0.23186; 0.22383; 0.2158; 0.20776; 0.2; 0.28081; 0.36162; 0.44242; 0.46667; 0.46667; 0.58481; 0.76203; 0.93333; 0.6068; 0.35402; 0.2936; 0.37441; 0.45522; 0.53333; 0.56232; 0.5913; 0.54631; 0.46577; 0.38523; 0.3047; 0.22416; 0.30884; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.46667; 0.4466; 0.42653; 0.40647; 0.3864; 0.36633; 0.34627; 0.3262; 0.30613; 0.28606; 0.26667; 0.26667; 0.26667; 0.26667; 0.38788; 0.5697; 0.75152; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.86667; 0.74545; 0.56364; 0.38182; 0.2747; 0.2988; 0.32289; 0.34699; 0.37108; 0.39518; 0.41928; 0.44337; 0.46667; 0.46667; 0.46667; 0.46667], 267,1)

function label(data::Array{Float64,2})
	# This is a function for testing purposes
	n = size(data,1)
	triplets = zeros(Int64, n*binomial(n-1, 2), 3)
	counter = 0
	D = distances(data, size(data,1))::Array{Float64,2}
	mistake = false
        for k = 1:n, j = 1:k-1, i = 1:n
            if i != j && i != k
                if D[i,j] < D[i,k]
		    counter +=1
        	    @inbounds triplets[counter,:] = [i, j, k]
                elseif D[i,j] > D[i,k]
		    counter += 1
		    @inbounds triplets[counter,:] = [i, k, j]
	        end
            end
	end
    return triplets[1:counter,:]
end

function distances(X::Array{Float64,2}, no_objects::Int64)
    D = zeros(Float64, no_objects, no_objects)

    for j = 1:no_objects, i = j:no_objects
        @inbounds D[i,j] = norm(X[i,:] - X[j,:])
    end

    return D + D'
end


triplets = label(data)
alpha = 2.0
lambda = 0.0
X = 0.0001*randn(size(data))
no_triplets = size(triplets,1)
no_objects = maximum(triplets)
no_dims = 1

Then, you can call the function on the gist.

EDIT: Fixes to the pasted code for it to run as expected.


#2

Your label function doesn’t run. Is it just missing an end somewhere?


#3

I added an extra end to get the code to run, but your data sample is also a 1-D array, while label() expects a 2D array. Assuming you actually want an Nx1 matrix, it also tries to call a distances() function which isn’t defined.

In any case, my first reaction from glancing at the code is that it would probably be more cache-friendly to store your triplets data in the opposite orientation. That is, triplets is currently Nx3, but you are typically accessing elements triplets[i, 1], [i, 2] and [i, 3] sequentially. Julia is column-major, so you want your innermost iteration to be along the first axis (this is the opposite of numpy). A 3xN matrix is likely to perform better, but I would personally use a Vector{SVector{3, Int}} using StaticArrays.jl. A vector of SVectors{3, Int} has exactly the same memory size and layout as a 3xN matrix of Ints but I personally think that a vector of vectors is more clearly represented in this way than as a matrix. Also, using a Vector of SVectors would likely let you change your code from:

for t in 1:no_triplets
        @inbounds triplets_A = triplets[t, 1]
        @inbounds triplets_B = triplets[t, 2]
        @inbounds triplets_C = triplets[t, 3]
...

to just

for triplet in triplets
   # <do stuff directly with triplet[1] etc.>

#4

You’re right, sorry about that. This is the correct function:

function label(data::Array{Float64,2})
	# This is a function for testing purposes
	n = size(data,1)
	triplets = zeros(Int64, n*binomial(n-1, 2), 3)
	counter = 0
	D = distances(data, size(data,1))::Array{Float64,2}
	mistake = false
        for k = 1:n, j = 1:k-1, i = 1:n
            if i != j && i != k
                if D[i,j] < D[i,k]
		    counter +=1
        	    @inbounds triplets[counter,:] = [i, j, k]
                elseif D[i,j] > D[i,k]
		    counter += 1
		    @inbounds triplets[counter,:] = [i, k, j]
	        end
            end
	end
    return triplets[1:counter,:]
end

#5

Again, the distances function is undefined and label takes a matrix but you give it a vector. Can’t you just run this in a fresh REPL to see that it works?


#6

You mean column-major :slight_smile: (just to avoid confusion)


#7

Regarding Threads.@threads, I was playing around with that myself this weekend and was also surprised by slowdowns and huge increases in allocations compared to the non-threaded version. In my case it turned out to be because of the old nemesis, https://github.com/JuliaLang/julia/issues/15276, as @threads creates a closure internally.

You should compare the @code_warntype for the non-parallelized version to what you get with Threads.@threads. If you’re seeing that variable types aren’t properly inferred anymore, the issue is likely to be what I described above. One of the standard workarounds, which worked in my case, is to use a let block, as described in https://github.com/JuliaLang/julia/issues/15276#issuecomment-318598339. On 0.6.2 however, part of the issue for me was that one of the variables created inside @threads (range) is also used in a closure. This has been fixed in master: https://github.com/JuliaLang/julia/pull/24688.


Parallel is very slow
Innefficient paralellization? Need some help optimizing a simple dot product
#8

Argh, whoops :stuck_out_tongue:


#9

Here is the distances function

function distances(X::Array{Float64,2}, no_objects::Int64)
    D = zeros(Float64, no_objects, no_objects)

    for j = 1:no_objects, i = j:no_objects
        @inbounds D[i,j] = norm(X[i,:] - X[j,:])
    end

    return D + D'
end

#10

Thanks @rdeits for the comments. When porting from Cython I tried both row-major and column-major approaches, and column-major had a significant improvement, as expected.

I didn’t know about static arrays, and your comment definitely makes sense. triplets is truly a set containing triples of objects. I’ll check how representing it that way changes performance.


#11

@tkoolen, I called the function using @code_warntype, and get the same output for the threaded and non-parallelized versions:

Non-parallelized

Variables:
  #self#::Embeddings.#∇tSTE
  X::Array{Float64,2}
  no_objects::Int64
  no_dims::Int64
  no_triplets::Int64
  triplets::Array{Int64,2}
  λ::Float64
  α::Float64

Body:
  begin
      return $(Expr(:invoke, MethodInstance for #∇tSTE#5(::Bool, ::Function, ::Array{Float64,2}, ::Int64, ::Int64, ::Int64, ::Array{Int64,2}, ::Float64, ::Float64), :(Embeddings.#∇tSTE#5), true, :(#self#), :(X), :(no_objects), :(no_dims), :(no_triplets), :(triplets), :(λ), :(α)))
  end::Tuple{Float64,Array{Float64,2}}

Threaded

Variables:
  #self#::Embeddings.#∇tSTE
  X::Array{Float64,2}
  no_objects::Int64
  no_dims::Int64
  no_triplets::Int64
  triplets::Array{Int64,2}
  λ::Float64
  α::Float64

Body:
  begin
      return $(Expr(:invoke, MethodInstance for #∇tSTE#5(::Bool, ::Function, ::Array{Float64,2}, ::Int64, ::Int64, ::Int64, ::Array{Int64,2}, ::Float64, ::Float64), :(Embeddings.#∇tSTE#5), true, :(#self#), :(X), :(no_objects), :(no_dims), :(no_triplets), :(triplets), :(λ), :(α)))
  end::Tuple{Float64,Array{Float64,2}}

I’m using Julia v0.6.2, not master. How/where can I use the let keyword? I’m not sure which of the variables is being inferred by the compiler in the loop for t in 1:no_triplets. Thanks for your help.


#12

∇tSTE is a keyword argument function. The code_warntype you show is just for the little bridge function that Julia 0.6 creates automatically, which does the keyword argument sorting before calling the ‘real’ function. Easiest way around this is to temporarily remove the keyword argument.


#13

Ok. I found two problems with the @threads implementation: Types for variables C and range are not inferred (not sure where range is used, as it’s not defined by me. I guess it’s the range over which a for loops?). This is the output of @code_warntype.

Variables:
  #self# <optimized out>
  X::Array{Float64,2}
  no_objects::Int64
  no_dims::Int64
  no_triplets::Int64
  triplets::Array{Int64,2}
  λ::Float64
  α::Float64
  i@_9::Int64
  #temp#@_10::Int64
  k@_11::Int64
  #temp#@_12::Int64
  k@_13::Int64
  #temp#@_14::Int64
  i@_15::Int64
  #temp#@_16::Int64
  j::Int64
  #temp#@_18::Int64
  n::Int64
  #temp#@_20::Int64
  i@_21::Int64
  #temp#@_22::Int64
  #5@_23 <optimized out>
  P::Array{Float64,1}
  C@_25::Core.Box
  ∇C::Array{Float64,2}
  sum_X::Array{Float64,1}
  K::Array{Float64,2}
  Q::Array{Float64,2}
  A_to_B::Array{Float64,1}
  A_to_C::Array{Float64,1}
  constant::Float64
  triplets_A::Array{Int64,1}
  triplets_B::Array{Int64,1}
  triplets_C::Array{Int64,1}
  range::Core.Box
  threadsfor_fun::Embeddings.##122#threadsfor_fun#7{Array{Float64,2},Int64,Array{Int64,2},Array{Float64,1},Array{Float64,2},Array{Float64,2},Array{Float64,2},Array{Float64,1},Array{Float64,1},Float64,Array{Int64,1},Array{Int64,1},Array{Int64,1}}
  #temp#@_38::Bool
  T <optimized out>
  shape <optimized out>
  iter <optimized out>
  C@_42::Array{Float64,2}
  keeps@_43::Tuple{Tuple{Bool,Bool}}
  Idefaults@_44::Tuple{Tuple{Int64,Int64}}
  #temp#@_45 <optimized out>
  keeps@_46 <optimized out>
  Idefaults@_47 <optimized out>
  #temp#@_48 <optimized out>
  keep@_49::Tuple{Bool,Bool}
  Idefault@_50::Tuple{Int64,Int64}
  #temp#@_51 <optimized out>
  ind1@_52 <optimized out>
  keep@_53::Tuple{Bool}
  Idefault@_54::Tuple{Int64}
  #temp#@_55 <optimized out>
  ind1@_56 <optimized out>
  keep@_57 <optimized out>
  Idefault@_58 <optimized out>
  #temp#@_59 <optimized out>
  I_1 <optimized out>
  val_1::Float64
  result::Float64
  I@_63 <optimized out>
  i#724::Int64
  I@_65 <optimized out>
  n#723::Int64
  i#722 <optimized out>
  #temp#@_68::CartesianIndex{1}
  r#721 <optimized out>
  A_1 <optimized out>
  keep_1::Tuple{Bool,Bool}
  Idefault_1::Tuple{Int64,Int64}
  start::CartesianIndex{1}
  stop <optimized out>
  #5@_75 <optimized out>
  #temp#@_76::CartesianIndex{1}
  #temp#@_77 <optimized out>
  #temp#@_78::Float64
  #temp#@_79::Float64

Body:
  begin
      C@_25::Core.Box = $(Expr(:new, :(Core.Box)))
      range::Core.Box = $(Expr(:new, :(Core.Box)))
      NewvarNode(:(threadsfor_fun::Embeddings.##122#threadsfor_fun#7{Array{Float64,2},Int64,Array{Int64,2},Array{Float64,1},Array{Float64,2},Array{Float64,2},Array{Float64,2},Array{Float64,1},Array{Float64,1},Float64,Array{Int64,1},Array{Int64,1},Array{Int64,1}}))
      SSAValue(0) = Tuple{Float64,Array{Float64,2}}
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(32) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(33) = (Base.pointerref)(SSAValue(32), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      P::Array{Float64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :((Base.sext_int)(Int64, SSAValue(33))::Int64), 0)) # line 127:
      $(Expr(:inbounds, false))
      # meta: location broadcast.jl broadcast 455
      # meta: location broadcast.jl broadcast_c 313
      # meta: location broadcast.jl broadcast_indices 48
      # meta: location broadcast.jl broadcast_indices 52
      # meta: location abstractarray.jl indices 64
      SSAValue(38) = (Base.arraysize)(X::Array{Float64,2}, 1)::Int64
      SSAValue(39) = (Base.arraysize)(X::Array{Float64,2}, 2)::Int64
      # meta: pop location
      # meta: pop location
      # meta: pop location
      SSAValue(128) = (Base.select_value)((Base.slt_int)(SSAValue(38), 0)::Bool, 0, SSAValue(38))::Int64
      SSAValue(129) = (Base.select_value)((Base.slt_int)(SSAValue(39), 0)::Bool, 0, SSAValue(39))::Int64 # line 316:
      # meta: location broadcast.jl broadcast_t 268
      C@_42::Array{Float64,2} = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), Array{Float64,2}, 0, SSAValue(128), 0, SSAValue(129), 0)) # line 269:
      # meta: location broadcast.jl map_newindexer 125 # line 126:
      # meta: location broadcast.jl newindexer 108
      # meta: location broadcast.jl broadcast_indices 48
      # meta: location broadcast.jl broadcast_indices 52
      # meta: location abstractarray.jl indices 64
      SSAValue(57) = (Base.arraysize)(X::Array{Float64,2}, 1)::Int64
      SSAValue(58) = (Base.arraysize)(X::Array{Float64,2}, 2)::Int64
      # meta: pop location
      # meta: pop location
      # meta: pop location
      SSAValue(80) = (Base.select_value)((Base.slt_int)(SSAValue(57), 0)::Bool, 0, SSAValue(57))::Int64
      SSAValue(81) = (Base.select_value)((Base.slt_int)(SSAValue(58), 0)::Bool, 0, SSAValue(58))::Int64
      # meta: location broadcast.jl shapeindexer 111 # line 112:
      SSAValue(70) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, (SSAValue(129) === SSAValue(81))::Bool)::Bool)::Tuple{Bool}
      SSAValue(71) = (Core.tuple)(1)::Tuple{Int64}
      keep@_53::Tuple{Bool} = SSAValue(70)
      Idefault@_54::Tuple{Int64} = SSAValue(71)
      # meta: pop location
      # meta: pop location
      SSAValue(82) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, (SSAValue(128) === SSAValue(80))::Bool)::Bool, (Core.getfield)(keep@_53::Tuple{Bool}, 1)::Bool)::Tuple{Bool,Bool}
      SSAValue(83) = (Core.tuple)(1, (Core.getfield)(Idefault@_54::Tuple{Int64}, 1)::Int64)::Tuple{Int64,Int64}
      keep@_49::Tuple{Bool,Bool} = SSAValue(82)
      Idefault@_50::Tuple{Int64,Int64} = SSAValue(83)
      # meta: pop location
      SSAValue(119) = (Core.tuple)(keep@_49::Tuple{Bool,Bool})::Tuple{Tuple{Bool,Bool}}
      SSAValue(120) = (Core.tuple)(Idefault@_50::Tuple{Int64,Int64})::Tuple{Tuple{Int64,Int64}}
      keeps@_43::Tuple{Tuple{Bool,Bool}} = SSAValue(119)
      Idefaults@_44::Tuple{Tuple{Int64,Int64}} = SSAValue(120) # line 270:
      # meta: location broadcast.jl _broadcast! 141
      # meta: location broadcast.jl # line 147:
      keep_1::Tuple{Bool,Bool} = (Base.getfield)(keeps@_43::Tuple{Tuple{Bool,Bool}}, 1)::Tuple{Bool,Bool} # line 148:
      Idefault_1::Tuple{Int64,Int64} = (Base.getfield)(Idefaults@_44::Tuple{Tuple{Int64,Int64}}, 1)::Tuple{Int64,Int64} # line 149:
      # meta: location simdloop.jl # line 66:
      # meta: location multidimensional.jl simd_outer_range 246
      start::CartesianIndex{1} = $(Expr(:new, CartesianIndex{1}, :((Core.tuple)(1)::Tuple{Int64})))
      # meta: pop location
      # meta: location multidimensional.jl start 206
      unless (Base.slt_int)(SSAValue(129), (Base.getfield)((Core.getfield)(start::CartesianIndex{1}, :I)::Tuple{Int64}, 1)::Int64)::Bool goto 81 # line 207:
      # meta: location multidimensional.jl + 111
      SSAValue(92) = (Core.tuple)((Base.add_int)(SSAValue(129), 1)::Int64)::Tuple{Int64}
      # meta: pop location
      #temp#@_76::CartesianIndex{1} = $(Expr(:new, CartesianIndex{1}, SSAValue(92)))
      goto 84
      81:  # line 209:
      #temp#@_76::CartesianIndex{1} = start::CartesianIndex{1}
      84:
      # meta: pop location
      #temp#@_68::CartesianIndex{1} = #temp#@_76::CartesianIndex{1}
      87:
      # meta: location multidimensional.jl done 224
      SSAValue(94) = (Core.getfield)(#temp#@_68::CartesianIndex{1}, :I)::Tuple{Int64}
      SSAValue(95) = (Base.getfield)(SSAValue(94), 1)::Int64
      # meta: pop location
      unless (Base.not_int)((Base.slt_int)(SSAValue(129), SSAValue(95))::Bool)::Bool goto 138
      SSAValue(113) = #temp#@_68::CartesianIndex{1}
      SSAValue(114) = $(Expr(:new, CartesianIndex{1}, :((Core.tuple)((Base.add_int)((Base.getfield)((Core.getfield)(#temp#@_68, :I)::Tuple{Int64}, 1)::Int64, 1)::Int64)::Tuple{Int64})))
      #temp#@_68::CartesianIndex{1} = SSAValue(114) # line 67:
      n#723::Int64 = (Base.add_int)((Base.sub_int)(SSAValue(128), 1)::Int64, 1)::Int64 # line 68:
      unless (Base.slt_int)(0, n#723::Int64)::Bool goto 136 # line 70:
      i#724::Int64 = 0 # line 71:
      NewvarNode(:(val_1::Float64))
      NewvarNode(:(result::Float64))
      105:
      unless (Base.slt_int)(i#724::Int64, n#723::Int64)::Bool goto 134 # line 72:
      SSAValue(124) = (Base.add_int)(i#724::Int64, 1)::Int64
      SSAValue(125) = (Core.getfield)((Core.getfield)(SSAValue(113), :I)::Tuple{Int64}, 1)::Int64 # line 73:
      # meta: location broadcast.jl # line 151:
      SSAValue(117) = (Base.select_value)((Base.getfield)(keep_1::Tuple{Bool,Bool}, 1)::Bool, SSAValue(124), (Base.getfield)(Idefault_1::Tuple{Int64,Int64}, 1)::Int64)::Int64
      SSAValue(118) = (Base.select_value)((Core.getfield)(keep_1::Tuple{Bool,Bool}, 2)::Bool, SSAValue(125), (Core.getfield)(Idefault_1::Tuple{Int64,Int64}, 2)::Int64)::Int64 # line 153:
      $(Expr(:inbounds, true))
      val_1::Float64 = (Base.arrayref)(X::Array{Float64,2}, SSAValue(117), SSAValue(118))::Float64
      $(Expr(:inbounds, :pop)) # line 155:
      result::Float64 = (Base.mul_float)(val_1::Float64, val_1::Float64)::Float64 # line 156:
      $(Expr(:inbounds, true))
      # meta: location multidimensional.jl setindex! 300
      (Base.arrayset)(C@_42::Array{Float64,2}, result::Float64, SSAValue(124), SSAValue(125))::Array{Float64,2}
      # meta: pop location
      $(Expr(:inbounds, :pop))
      # meta: pop location # line 74:
      i#724::Int64 = (Base.add_int)(i#724::Int64, 1)::Int64 # line 75:
      $(Expr(:simdloop))
      132:
      goto 105
      134:  # line 79:
      136:
      goto 87
      138:
      # meta: pop location
      # meta: pop location
      # meta: pop location
      # meta: pop location
      goto 145 # line 321:
      145:
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(5) = $(Expr(:invoke, MethodInstance for _mapreduce(::Base.#identity, ::Base.#+, ::IndexLinear, ::Array{Float64,2}), :(Base._mapreduce), :(Base.identity), :(Base.+), :($(QuoteNode(IndexLinear()))), :(C@_42)))
      SSAValue(7) = (Base.mul_float)(λ::Float64, SSAValue(5))::Float64
      SSAValue(8) = (Base.add_float)(0.0, SSAValue(7))::Float64
      (Core.setfield!)(C@_25::Core.Box, :contents, SSAValue(8))::Float64 # line 128:
      ∇C::Array{Float64,2} = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), Array{Float64,2}, 0, :(no_objects), 0, :(no_dims), 0)) # line 130:
      sum_X::Array{Float64,1} = $(Expr(:invoke, MethodInstance for fill!(::Array{Float64,1}, ::Float64), :(Base.fill!), :($(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :(no_objects), 0))), :((Base.sitofp)(Float64, 0)::Float64))) # line 131:
      SSAValue(177) = no_objects::Int64
      SSAValue(178) = no_objects::Int64
      $(Expr(:inbounds, false))
      # meta: location array.jl zeros 265
      # meta: location array.jl zeros 263
      SSAValue(141) = SSAValue(177)
      SSAValue(142) = SSAValue(178)
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      K::Array{Float64,2} = $(Expr(:invoke, MethodInstance for fill!(::Array{Float64,2}, ::Float64), :(Base.fill!), :($(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), Array{Float64,2}, 0, SSAValue(141), 0, SSAValue(142), 0))), :((Base.sitofp)(Float64, 0)::Float64))) # line 132:
      SSAValue(179) = no_objects::Int64
      SSAValue(180) = no_objects::Int64
      $(Expr(:inbounds, false))
      # meta: location array.jl zeros 265
      # meta: location array.jl zeros 263
      SSAValue(147) = SSAValue(179)
      SSAValue(148) = SSAValue(180)
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      Q::Array{Float64,2} = $(Expr(:invoke, MethodInstance for fill!(::Array{Float64,2}, ::Float64), :(Base.fill!), :($(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), Array{Float64,2}, 0, SSAValue(147), 0, SSAValue(148), 0))), :((Base.sitofp)(Float64, 0)::Float64))) # line 134:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(149) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(150) = (Base.pointerref)(SSAValue(149), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      A_to_B::Array{Float64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :((Base.sext_int)(Int64, SSAValue(150))::Int64), 0)) # line 135:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(152) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(153) = (Base.pointerref)(SSAValue(152), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      A_to_C::Array{Float64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), Array{Float64,1}, 0, :((Base.sext_int)(Int64, SSAValue(153))::Int64), 0)) # line 136:
      constant::Float64 = (Base.div_float)((Base.add_float)(α::Float64, (Base.sitofp)(Float64, 1)::Float64)::Float64, α::Float64)::Float64 # line 138:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(155) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(156) = (Base.pointerref)(SSAValue(155), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      triplets_A::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, :((Base.sext_int)(Int64, SSAValue(156))::Int64), 0)) # line 139:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(158) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(159) = (Base.pointerref)(SSAValue(158), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      triplets_B::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, :((Base.sext_int)(Int64, SSAValue(159))::Int64), 0)) # line 140:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl nthreads 19
      SSAValue(161) = (Base.Threads.cglobal)(:jl_n_threads, Base.Threads.Cint)::Ptr{Int32}
      SSAValue(162) = (Base.pointerref)(SSAValue(161), 1, 1)::Int32
      # meta: pop location
      $(Expr(:inbounds, :pop))
      triplets_C::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, :((Base.sext_int)(Int64, SSAValue(162))::Int64), 0)) # line 144:
      SSAValue(181) = (Base.select_value)((Base.sle_int)(1, no_dims::Int64)::Bool, no_dims::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_12::Int64 = 1
      226:
      unless (Base.not_int)((#temp#@_12::Int64 === (Base.add_int)(SSAValue(181), 1)::Int64)::Bool)::Bool goto 249
      SSAValue(182) = #temp#@_12::Int64
      SSAValue(183) = (Base.add_int)(#temp#@_12::Int64, 1)::Int64
      k@_11::Int64 = SSAValue(182)
      #temp#@_12::Int64 = SSAValue(183)
      SSAValue(184) = (Base.select_value)((Base.sle_int)(1, no_objects::Int64)::Bool, no_objects::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_10::Int64 = 1
      234:
      unless (Base.not_int)((#temp#@_10::Int64 === (Base.add_int)(SSAValue(184), 1)::Int64)::Bool)::Bool goto 247
      SSAValue(185) = #temp#@_10::Int64
      SSAValue(186) = (Base.add_int)(#temp#@_10::Int64, 1)::Int64
      i@_9::Int64 = SSAValue(185)
      #temp#@_10::Int64 = SSAValue(186) # line 145:
      $(Expr(:inbounds, true))
      SSAValue(13) = (Base.add_float)((Base.arrayref)(sum_X::Array{Float64,1}, i@_9::Int64)::Float64, (Base.mul_float)((Base.arrayref)(X::Array{Float64,2}, i@_9::Int64, k@_11::Int64)::Float64, (Base.arrayref)(X::Array{Float64,2}, i@_9::Int64, k@_11::Int64)::Float64)::Float64)::Float64
      (Base.arrayset)(sum_X::Array{Float64,1}, SSAValue(13), i@_9::Int64)::Array{Float64,1}
      $(Expr(:inbounds, :pop))
      245:
      goto 234
      247:
      goto 226
      249:  # line 148:
      SSAValue(187) = (Base.select_value)((Base.sle_int)(1, no_objects::Int64)::Bool, no_objects::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_18::Int64 = 1
      253:
      unless (Base.not_int)((#temp#@_18::Int64 === (Base.add_int)(SSAValue(187), 1)::Int64)::Bool)::Bool goto 342
      SSAValue(188) = #temp#@_18::Int64
      SSAValue(189) = (Base.add_int)(#temp#@_18::Int64, 1)::Int64
      j::Int64 = SSAValue(188)
      #temp#@_18::Int64 = SSAValue(189)
      SSAValue(190) = (Base.select_value)((Base.sle_int)(1, no_objects::Int64)::Bool, no_objects::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_16::Int64 = 1
      261:
      unless (Base.not_int)((#temp#@_16::Int64 === (Base.add_int)(SSAValue(190), 1)::Int64)::Bool)::Bool goto 340
      SSAValue(191) = #temp#@_16::Int64
      SSAValue(192) = (Base.add_int)(#temp#@_16::Int64, 1)::Int64
      i@_15::Int64 = SSAValue(191)
      #temp#@_16::Int64 = SSAValue(192) # line 149:
      $(Expr(:inbounds, true))
      SSAValue(18) = (Base.add_float)((Base.arrayref)(sum_X::Array{Float64,1}, i@_15::Int64)::Float64, (Base.arrayref)(sum_X::Array{Float64,1}, j::Int64)::Float64)::Float64
      (Base.arrayset)(K::Array{Float64,2}, SSAValue(18), i@_15::Int64, j::Int64)::Array{Float64,2}
      $(Expr(:inbounds, :pop)) # line 150:
      SSAValue(193) = (Base.select_value)((Base.sle_int)(1, no_dims::Int64)::Bool, no_dims::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_14::Int64 = 1
      275:
      unless (Base.not_int)((#temp#@_14::Int64 === (Base.add_int)(SSAValue(193), 1)::Int64)::Bool)::Bool goto 338
      SSAValue(194) = #temp#@_14::Int64
      SSAValue(195) = (Base.add_int)(#temp#@_14::Int64, 1)::Int64
      k@_13::Int64 = SSAValue(194)
      #temp#@_14::Int64 = SSAValue(195) # line 155:
      $(Expr(:inbounds, true))
      SSAValue(21) = (Base.add_float)((Base.arrayref)(K::Array{Float64,2}, i@_15::Int64, j::Int64)::Float64, (Base.mul_float)((Base.mul_float)((Base.sitofp)(Float64, -2)::Float64, (Base.arrayref)(X::Array{Float64,2}, i@_15::Int64, k@_13::Int64)::Float64)::Float64, (Base.arrayref)(X::Array{Float64,2}, j::Int64, k@_13::Int64)::Float64)::Float64)::Float64
      (Base.arrayset)(K::Array{Float64,2}, SSAValue(21), i@_15::Int64, j::Int64)::Array{Float64,2}
      $(Expr(:inbounds, :pop)) # line 156:
      $(Expr(:inbounds, true))
      SSAValue(164) = (Base.add_float)((Base.sitofp)(Float64, 1)::Float64, (Base.div_float)((Base.arrayref)(K::Array{Float64,2}, i@_15::Int64, j::Int64)::Float64, α::Float64)::Float64)::Float64
      $(Expr(:inbounds, false))
      # meta: location intfuncs.jl literal_pow 208
      $(Expr(:inbounds, false))
      # meta: location math.jl ^ 701
      SSAValue(165) = (Base.sitofp)(Float64, -1)::Float64
      # meta: location math.jl ^ 699
      SSAValue(168) = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), SSAValue(164), 0, SSAValue(165), 0, :($(Expr(:llvmcall)))))
      SSAValue(169) = (Base.add_float)(SSAValue(164), SSAValue(165))::Float64
      # meta: location math.jl nan_dom_err 300
      unless (Base.and_int)((Base.ne_float)(SSAValue(168), SSAValue(168))::Bool, (Base.not_int)((Base.ne_float)(SSAValue(169), SSAValue(169))::Bool)::Bool)::Bool goto 301
      #temp#@_78::Float64 = (Base.Math.throw)($(QuoteNode(DomainError())))::Union{}
      goto 303
      301:
      #temp#@_78::Float64 = SSAValue(168)
      303:
      # meta: pop location
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(22) = #temp#@_78::Float64
      (Base.arrayset)(Q::Array{Float64,2}, SSAValue(22), i@_15::Int64, j::Int64)::Array{Float64,2}
      $(Expr(:inbounds, :pop)) # line 157:
      $(Expr(:inbounds, true))
      SSAValue(171) = (Base.add_float)((Base.sitofp)(Float64, 1)::Float64, (Base.div_float)((Base.arrayref)(K::Array{Float64,2}, i@_15::Int64, j::Int64)::Float64, α::Float64)::Float64)::Float64
      SSAValue(170) = (Base.div_float)((Base.add_float)(α::Float64, (Base.sitofp)(Float64, 1)::Float64)::Float64, (Base.sitofp)(Float64, -2)::Float64)::Float64
      $(Expr(:inbounds, false))
      # meta: location math.jl ^ 699
      SSAValue(174) = $(Expr(:foreigncall, "llvm.pow.f64", Float64, svec(Float64, Float64), SSAValue(171), 0, SSAValue(170), 0, :($(Expr(:llvmcall)))))
      SSAValue(175) = (Base.add_float)(SSAValue(171), SSAValue(170))::Float64
      $(Expr(:inbounds, false))
      # meta: location math.jl nan_dom_err 300
      unless (Base.and_int)((Base.ne_float)(SSAValue(174), SSAValue(174))::Bool, (Base.not_int)((Base.ne_float)(SSAValue(175), SSAValue(175))::Bool)::Bool)::Bool goto 326
      #temp#@_79::Float64 = (Base.Math.throw)($(QuoteNode(DomainError())))::Union{}
      goto 328
      326:
      #temp#@_79::Float64 = SSAValue(174)
      328:
      # meta: pop location
      $(Expr(:inbounds, :pop))
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(23) = #temp#@_79::Float64
      (Base.arrayset)(K::Array{Float64,2}, SSAValue(23), i@_15::Int64, j::Int64)::Array{Float64,2}
      $(Expr(:inbounds, :pop))
      336:
      goto 275
      338:
      goto 261
      340:
      goto 253
      342:  # line 162:
      # meta: location threadingconstructs.jl # line 28:
      SSAValue(24) = $(Expr(:new, UnitRange{Int64}, 1, :((Base.select_value)((Base.sle_int)(1, no_triplets)::Bool, no_triplets, (Base.sub_int)(1, 1)::Int64)::Int64)))
      (Core.setfield!)(range::Core.Box, :contents, SSAValue(24))::UnitRange{Int64} # line 29:
      threadsfor_fun::Embeddings.##122#threadsfor_fun#7{Array{Float64,2},Int64,Array{Int64,2},Array{Float64,1},Array{Float64,2},Array{Float64,2},Array{Float64,2},Array{Float64,1},Array{Float64,1},Float64,Array{Int64,1},Array{Int64,1},Array{Int64,1}} = $(Expr(:new, Embeddings.##122#threadsfor_fun#7{Array{Float64,2},Int64,Array{Int64,2},Array{Float64,1},Array{Float64,2},Array{Float64,2},Array{Float64,2},Array{Float64,1},Array{Float64,1},Float64,Array{Int64,1},Array{Int64,1},Array{Int64,1}}, :(X), :(no_dims), :(triplets), :(P), :(C@_25), :(∇C), :(K), :(Q), :(A_to_B), :(A_to_C), :(constant), :(triplets_A), :(triplets_B), :(triplets_C), :(range))) # line 67:
      $(Expr(:inbounds, false))
      # meta: location threadingconstructs.jl threadid 10
      SSAValue(176) = $(Expr(:foreigncall, :(:jl_threadid), Int16, svec()))
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(25) = (Base.not_int)(((Base.add_int)((Base.sext_int)(Int64, SSAValue(176))::Int64, 1)::Int64 === 1)::Bool)::Bool
      unless SSAValue(25) goto 360
      #temp#@_38::Bool = SSAValue(25)
      goto 362
      360:
      #temp#@_38::Bool = (Core.getfield)(Base.Threads.in_threaded_loop, :x)::Bool
      362:
      unless #temp#@_38::Bool goto 367 # line 69:
      $(Expr(:invoke, MethodInstance for (::Embeddings.##122#threadsfor_fun#7{Array{Float64,2},Int64,Array{Int64,2},Array{Float64,1},Array{Float64,2},Array{Float64,2},Array{Float64,2},Array{Float64,1},Array{Float64,1},Float64,Array{Int64,1},Array{Int64,1},Array{Int64,1}})(::Bool), :(threadsfor_fun), true))
      goto 382
      367:  # line 71:
      $(Expr(:inbounds, false))
      # meta: location refpointer.jl setindex! 121
      (Core.setfield!)(Base.Threads.in_threaded_loop, :x, true)::Bool
      # meta: pop location
      $(Expr(:inbounds, :pop)) # line 73:
      $(Expr(:foreigncall, :(:jl_threading_run), Ref{Void}, svec(Any), :(threadsfor_fun), 0)) # line 74:
      $(Expr(:inbounds, false))
      # meta: location refpointer.jl setindex! 121
      (Core.setfield!)(Base.Threads.in_threaded_loop, :x, false)::Bool
      # meta: pop location
      $(Expr(:inbounds, :pop))
      382:
      # meta: pop location # line 188:
      SSAValue(196) = (Base.select_value)((Base.sle_int)(1, no_dims::Int64)::Bool, no_dims::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_22::Int64 = 1
      387:
      unless (Base.not_int)((#temp#@_22::Int64 === (Base.add_int)(SSAValue(196), 1)::Int64)::Bool)::Bool goto 410
      SSAValue(197) = #temp#@_22::Int64
      SSAValue(198) = (Base.add_int)(#temp#@_22::Int64, 1)::Int64
      i@_21::Int64 = SSAValue(197)
      #temp#@_22::Int64 = SSAValue(198)
      SSAValue(199) = (Base.select_value)((Base.sle_int)(1, no_objects::Int64)::Bool, no_objects::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#@_20::Int64 = 1
      395:
      unless (Base.not_int)((#temp#@_20::Int64 === (Base.add_int)(SSAValue(199), 1)::Int64)::Bool)::Bool goto 408
      SSAValue(200) = #temp#@_20::Int64
      SSAValue(201) = (Base.add_int)(#temp#@_20::Int64, 1)::Int64
      n::Int64 = SSAValue(200)
      #temp#@_20::Int64 = SSAValue(201) # line 190:
      $(Expr(:inbounds, true))
      SSAValue(31) = (Base.add_float)((Base.neg_float)((Base.arrayref)(∇C::Array{Float64,2}, n::Int64, i@_21::Int64)::Float64)::Float64, (Base.mul_float)((Base.mul_float)((Base.sitofp)(Float64, 2)::Float64, λ::Float64)::Float64, (Base.arrayref)(X::Array{Float64,2}, n::Int64, i@_21::Int64)::Float64)::Float64)::Float64
      (Base.arrayset)(∇C::Array{Float64,2}, SSAValue(31), n::Int64, i@_21::Int64)::Array{Float64,2}
      $(Expr(:inbounds, :pop))
      406:
      goto 395
      408:
      goto 387
      410:  # line 193:
      return (Core.typeassert)((Base.convert)(SSAValue(0), (Core.tuple)((Core.getfield)(C@_25::Core.Box, :contents)::Any, ∇C::Array{Float64,2})::Tuple{Any,Array{Float64,2}})::Tuple{Any,Array{Float64,2}}, SSAValue(0))::Tuple{Float64,Array{Float64,2}}
  end::Tuple{Float64,Array{Float64,2}}

#14

range is created in @threads: https://github.com/JuliaLang/julia/blob/e6b290aa52d01ecfc2e98cfbc7a1034a419164b0/base/threadingconstructs.jl#L29, and has already been wrapped in a let block on Julia master (the PR I referred to earlier). For C, I recommend changing it from a Float64 to a Ref{Float64} and wrapping it in a let block, i.e.:

let C = Ref(Float64(λ * sum(X.^2))) # Initialize cost including l2 regularization cost
    Threads.@threads for t in 1:no_triplets
        # use C[] instead of C
        ...
    end
end

This is all quite unfortunate, I’m really hoping the remaining instances of https://github.com/JuliaLang/julia/issues/15276 will be resolved soon.


#15

Thanks for the help. I installed Julia v0.7, and checked that it came with the change in line 29 of https://github.com/JuliaLang/julia/blob/e6b290aa52d01ecfc2e98cfbc7a1034a419164b0/base/threadingconstructs.jl#L29 (the let for the range).

Including let C = Ref(Float64(λ * sum(X.^2))) ... end produced huge speed up, but still slower than linear:

Threads.@threads
================
BenchmarkTools.Trial: 
  memory estimate:  1.17 MiB
  allocs estimate:  534
  --------------
  minimum time:     359.284 ms (0.00% GC)
  median time:      432.999 ms (0.00% GC)
  mean time:        438.884 ms (0.00% GC)
  maximum time:     515.517 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

Nevertheless, the output of C is 0.0 (as defined initially). How do I recover the value of C from the let block? Since it’s defined only within that block, I can’t return that variable. Is there a workaround or something I’m not seeing?

Another way to do this is to define C = zeros(Float64, Threads.nthreads()) + λ * sum(X.^2) and then call it in the for loop using C[Threads.threadid()] += -log(P[Threads.threadid()]). Then I use add values in C. This change gets me the following time and I obtain the value of C:

Threads.@threads (C = zeros(Float64, Threads.nthreads()) + λ * sum(X.^2))
================
  BenchmarkTools.Trial:
  memory estimate:  1.19 MiB
  allocs estimate:  713
  --------------
  minimum time:     375.732 ms (0.00% GC)
  median time:      477.595 ms (0.00% GC)
  mean time:        486.294 ms (0.00% GC)
  maximum time:     635.674 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

This approach has still some Union{}'s marked in red as output of @code_warntype, but I still can’t identify what’s the source of those. All the other variables are correctly inferred (no more Core.Box). It’s in this link https://gist.github.com/kmundnic/d365d18c94919a344f9635a8405322e3


#16

Using

Julia Version 0.7.0-DEV.2803
Commit 13dc388b11* (2017-12-08 21:44 UTC)
Environment:
  JULIA_NUM_THREADS = 4

I could get a factor 4 speed up compared to the simple version using threads with this code.
I factored out the inner part of the loop in an extra function and subdivided the 1:no_triplets range into Threads.nthreads() ranges of equalish size. I also changed the code a little bit compared to your threaded version since I don’t think your version was thread-safe, in particular the updates to ∇C.

Unfortunately this does not work under 0.6 due to the range bug described by @`tkoolen.


#17

You shouldn’t have to do that, right? @threads should do that for you, see https://github.com/JuliaLang/julia/blob/555264e1f005eb390ad22c29c26cf14c662b0677/base/threadingconstructs.jl#L32-L47.


#18

Yeah but so I can add an additional function barrier with the explicit kernel function and even if some types are messed up due to the closure bug, I only have to pay nthreads() times for the dynamic dispatch.


Innefficient paralellization? Need some help optimizing a simple dot product
Parallel is very slow
#19

Ah, I see.


#20

This is great, thanks @saschatimme! I tested in in v0.6.2 and 0.7.0-DEV.4052 (commit bc85693b2a (2018-02-21 19:56 UTC)), and for me it works in v0.6.2, and runs a bit faster than v0.7.0. I’m getting a ~6.5x speed boost with your code using 8 threads:

v0.6.2
=================
BenchmarkTools.Trial:
  memory estimate:  1.13 MiB
  allocs estimate:  1149
  --------------
  minimum time:     41.298 ms (0.00% GC)
  median time:      45.578 ms (0.00% GC)
  mean time:        45.781 ms (0.00% GC)
  maximum time:     52.321 ms (0.00% GC)
  --------------
  samples:          110
  evals/sample:     1
v0.7.0-DEV.4052 (commit bc85693b2a)
=================
BenchmarkTools.Trial:
  memory estimate:  1.15 MiB
  allocs estimate:  1295
  --------------
  minimum time:     47.402 ms (0.00% GC)
  median time:      52.539 ms (0.00% GC)
  mean time:        53.071 ms (0.18% GC)
  maximum time:     62.198 ms (0.00% GC)
  --------------
  samples:          95
  evals/sample:     1

I wonder why it didn’t work for you in v0.6?

It could definitely be that my implementation was not thread safe. I’m still learning how to work with threads. Your example is definitely useful. Thanks again!

Something that I noticed is that the output of both functions (non-parallel and threaded) is not the same starting in the 2nd or 3rd decimal point. I’ll continue looking into that.