Large memory allocation when a loop is threaded, none when run single-threaded

I have a simple volume rendering loop coded in julia. It uses fixed size 3d vector classes based around the StaticArrays package, derive from FieldVector:

struct Vector3D{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

const v3f=Vector3D{Float32}

This keeps them on the stack generally, and i have verified that there is no memory allocation is going on in my functions that use them:

function TraceRay( vPos::Vector3D, vDir::Vector3D )::v3f
    local vCurPos = vPos;
    local vStep = vDir * 0.02f0;
    local vColor = v3f( 0f0, 0f0, 0f0 )
    
    local flOpacity::Float32 = 1.0
    for i in 0:50
        local density = 0.1 * Density( vCurPos )
        local lighting = Lighting( vCurPos )
        vColor += flOpacity * density * lighting
        flOpacity *= (1f0 - density)
        vCurPos += vStep
    end
    return vColor
end
@time begin
    local acc::v3f = v3f( 0, 0, 0 )
    for i in 1:1000
        acc += TraceRay( v3f( i, 0 ,0 ), v3f( 0, 0, 1 ) )
    end
end

No allocations reported!!

So, I use the TraceArray function to fill an array:

function Render()
    local myimage = Matrix{v3f}(undef, 32, 32 )

    for y in 1:size( myimage, 2 )
        for x in 1:size( myimage, 1 )
           local flX::Float32 = -.5f0 + x / size( myimage, 1 )
           local flY::Float32 = -.5f0 + y / size( myimage, 2 )
           myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
        end
    end
    return reinterpret( RGB{Float32}, myimage )
end
@time Render();

1 Allocation is reported - the init of myimage

I then thread the outer loop:

function RenderThreaded()
    local myimage = Matrix{v3f}(undef, 32, 32 )

    Threads.@threads for y in 1:size( myimage, 2 )
        for x in 1:size( myimage, 1 )
           local flX::Float32 = -.5f0 + x / size( myimage, 1 )
           local flY::Float32 = -.5f0 + y / size( myimage, 2 )
           myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
        end
    end
    return reinterpret( RGB{Float32}, myimage )
end

@time RenderThreaded();

Suddenly it’s doing 120K allocations instead of 1!!!
0.044553 seconds (120.35 k allocations: 7.423 MiB, 0.01% compilation time)

The number of mysterious allocations, and their total size are not affected by the image size or the number of loop iterations. This happens on every run, so it’s not some startup allocation for the first time something threaded is done.

Any idea what’s going on? Note that this system has 24 threads as set by the environment variable.

Thanks for any help

1 Like

it’s expected that the allocation will be greater than 1 just because threads are working on the shared memory now. Do you see useful speed up? Maybe try LoopVectorization.jl or Polyester.jl?

Can you explain more? All the outputs arrays are allocated up front and the global vars are const. Is it cloning some sort of data to thread local storage every time it starts a job or something?

I’m new to julia, but my job is writing highly threaded numeric c++ code, so my comparison is to something like intel threaded building blocks, (or rolling my own threadpool), where the memory overhead of doing a parallel for is tiny, and you can do quite small jobs and still come out ahead.

It gets a decent speedup on large image sizes like 1024x1024, for smaller problems it spends a lot of time in those allocations. I measured a 54x speedup for the threaded version on a dual core Xeon with 48 cores/96 threads.

BTW, the number of allocations did not seem to go up with the number of threads - there were roughly the same # on my 12 core windows box as my 48 core AWS linux box.

hmm, LoopVectorization looks like it is for SIMD vectorization, not threads, but Polyester looks interesting. Will give it a try. Thanks for the tip.

someone else can explain better / at lower-level than me but the gist is:

  1. multi-threading is almost never completely free, there’s some fixed overhead (which may not be an issue, depending on problem size, or maybe Polyester.jl can be helpful)
  2. you can amplify overhead when you have other sub-optimalness such as false sharing:
julia> function f()
           a = zeros(Int, 10)
           for _ = 1:1000
               for i in eachindex(a)
              #Threads.@threads for i in eachindex(a)
                   a[i] += 1
               end
           end
           a
       end

(1 allocation: 144 bytes) vs. (21.33 k allocations: 1.918 MiB)

In your example, because you have 24 threads and only 32 columns, false sharing could reasonably happen at the beginning/end of adjacent columns.

Is this the equivalent of, in c++, accidentally capturing a large data structure by value in a lambda used with a parallel for?

In your example, obviously you’d want to thread the outer loop, not the inner one, but I would expect it to just be creating a job whose context is the value of i, and a reference to a?
Are you saying instead its somehow deciding to clone “a” for each thread, and then sync the values back that it wrote into “a”, because it thinks it should?

This almost always means that you are running into something funny going on with the closure created by @threads. performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub My rule of thumb is that if adding @threads to a for-loop slows down the single-threaded case significantly you probably have a type-stability issue.

Kinda. It’s probably Threads.@threads makes this function type unstable · Issue #41731 · JuliaLang/julia · GitHub. Can you post a full/standalone reproducer? Easier to explain what is going wrong if I can show you how I debug this on your code.

What is @code_warntype RenderThreaded()?

3 Likes

LoopVectorization can thread, too. @turbo threads=true or @tturbo.
However, you’d have to reinterpret the arrays into arrays of primitive types for it to work.

1 Like

Polyester sure helped. On the 12 core system for generating a 32x32 image:

Threads.@threads is actually slower than serial by 1/3rd, and does 120K allocations adding up to 7.8 mbytes

Polyester’s @batch with per=thread is >9x the speed of single threaded and does 124 allocations adding up to 24k.

On the 96 HWThread AWS box, for a 128x128 image, it actually gets a 106x speedup. I would guess the superlinear effect comes from making use of 48 L1+L2 caches instead of 1, and maybe a little from NUMA

I tried reproducing it:

using Colors
using StaticArrays

struct Vector3D{T} <: FieldVector{3, T}
           x::T
           y::T
           z::T
       end

const v3f=Vector3D{Float32}


function TraceRay( vPos::Vector3D, vDir::Vector3D )::v3f
           local vCurPos = vPos;
           local vStep = vDir * 0.02f0;
           local vColor = v3f( 0f0, 0f0, 0f0 )
           
           local flOpacity::Float32 = 1.0
           for i in 0:50
               local density = 0.1 #Changed
               local lighting = 0.1 #Changed
               vColor = vcolor .+ flOpacity * density * lighting #Changed
               flOpacity *= (1f0 - density)
               vCurPos += vStep
           end
           return vColor
       end
function RenderThreaded()
           local myimage = Matrix{v3f}(undef, 32, 32 )

           Threads.@threads for y in 1:size( myimage, 2 )
               for x in 1:size( myimage, 1 )
                  local flX::Float32 = -.5f0 + x / size( myimage, 1 )
                  local flY::Float32 = -.5f0 + y / size( myimage, 2 )
                  myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
               end
           end
           return reinterpret( RGB{Float32}, myimage )
       end

and on both Julia 1.6/1.7

ulia> @time RenderThreaded();
  0.545423 seconds (1.22 M allocations: 73.109 MiB, 10.16% gc time, 99.97% compilation time)

julia> @time RenderThreaded();
  0.000140 seconds (7 allocations: 12.656 KiB)

… Let’s not jump to other tools, before we understand why @threads is causing allocations.

4 Likes

Hi. I’ll assemble a single file to reproduce - I want to try it at work tomorrow on
a 128 core/256thread windows machine. I’m curious to find out if Julia has the right thread setup code to work around the windows “processor group” issue on systems with >64 hardware threads.

1 Like

If you have more than 16 cores, you don’t want to be on Windows (in general). Windows thread scheduling is mediocre at best, and NTFS can easily cause 10x slowdowns compared to modern filesystems in programs that are using IO.

Well, I’m in the game industry. Development is generally on windows, with large teams using in house tools that scale to a lot more than 16 cores.

I agree about the thread scheduling lacking versatility compared to linux, and there are specific things that really scale better like the filesystem as you mention.
But I wouldn’t have wasted the $ on a 44 core and a 128 core windows box if it wasn’t worth it :slight_smile:

2 Likes