Large memory allocation when a loop is threaded, none when run single-threaded

Chris_Green · October 11, 2021, 3:07am

I have a simple volume rendering loop coded in julia. It uses fixed size 3d vector classes based around the StaticArrays package, derive from FieldVector:

struct Vector3D{T} <: FieldVector{3, T}
    x::T
    y::T
    z::T
end

const v3f=Vector3D{Float32}

This keeps them on the stack generally, and i have verified that there is no memory allocation is going on in my functions that use them:

function TraceRay( vPos::Vector3D, vDir::Vector3D )::v3f
    local vCurPos = vPos;
    local vStep = vDir * 0.02f0;
    local vColor = v3f( 0f0, 0f0, 0f0 )
    
    local flOpacity::Float32 = 1.0
    for i in 0:50
        local density = 0.1 * Density( vCurPos )
        local lighting = Lighting( vCurPos )
        vColor += flOpacity * density * lighting
        flOpacity *= (1f0 - density)
        vCurPos += vStep
    end
    return vColor
end

@time begin
    local acc::v3f = v3f( 0, 0, 0 )
    for i in 1:1000
        acc += TraceRay( v3f( i, 0 ,0 ), v3f( 0, 0, 1 ) )
    end
end

No allocations reported!!

So, I use the TraceArray function to fill an array:

function Render()
    local myimage = Matrix{v3f}(undef, 32, 32 )

    for y in 1:size( myimage, 2 )
        for x in 1:size( myimage, 1 )
           local flX::Float32 = -.5f0 + x / size( myimage, 1 )
           local flY::Float32 = -.5f0 + y / size( myimage, 2 )
           myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
        end
    end
    return reinterpret( RGB{Float32}, myimage )
end
@time Render();

1 Allocation is reported - the init of myimage

I then thread the outer loop:

function RenderThreaded()
    local myimage = Matrix{v3f}(undef, 32, 32 )

    Threads.@threads for y in 1:size( myimage, 2 )
        for x in 1:size( myimage, 1 )
           local flX::Float32 = -.5f0 + x / size( myimage, 1 )
           local flY::Float32 = -.5f0 + y / size( myimage, 2 )
           myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
        end
    end
    return reinterpret( RGB{Float32}, myimage )
end

@time RenderThreaded();

Suddenly it’s doing 120K allocations instead of 1!!!
0.044553 seconds (120.35 k allocations: 7.423 MiB, 0.01% compilation time)

The number of mysterious allocations, and their total size are not affected by the image size or the number of loop iterations. This happens on every run, so it’s not some startup allocation for the first time something threaded is done.

Any idea what’s going on? Note that this system has 24 threads as set by the environment variable.

Thanks for any help

jling · October 11, 2021, 3:23am

it’s expected that the allocation will be greater than 1 just because threads are working on the shared memory now. Do you see useful speed up? Maybe try LoopVectorization.jl or Polyester.jl?

Chris_Green · October 11, 2021, 3:32am

Can you explain more? All the outputs arrays are allocated up front and the global vars are const. Is it cloning some sort of data to thread local storage every time it starts a job or something?

I’m new to julia, but my job is writing highly threaded numeric c++ code, so my comparison is to something like intel threaded building blocks, (or rolling my own threadpool), where the memory overhead of doing a parallel for is tiny, and you can do quite small jobs and still come out ahead.

It gets a decent speedup on large image sizes like 1024x1024, for smaller problems it spends a lot of time in those allocations. I measured a 54x speedup for the threaded version on a dual core Xeon with 48 cores/96 threads.

BTW, the number of allocations did not seem to go up with the number of threads - there were roughly the same # on my 12 core windows box as my 48 core AWS linux box.

Chris_Green · October 11, 2021, 3:38am

hmm, LoopVectorization looks like it is for SIMD vectorization, not threads, but Polyester looks interesting. Will give it a try. Thanks for the tip.

jling · October 11, 2021, 3:38am

someone else can explain better / at lower-level than me but the gist is:

multi-threading is almost never completely free, there’s some fixed overhead (which may not be an issue, depending on problem size, or maybe Polyester.jl can be helpful)
you can amplify overhead when you have other sub-optimalness such as false sharing:

julia> function f()
           a = zeros(Int, 10)
           for _ = 1:1000
               for i in eachindex(a)
              #Threads.@threads for i in eachindex(a)
                   a[i] += 1
               end
           end
           a
       end

(1 allocation: 144 bytes) vs. (21.33 k allocations: 1.918 MiB)

In your example, because you have 24 threads and only 32 columns, false sharing could reasonably happen at the beginning/end of adjacent columns.

Chris_Green · October 11, 2021, 3:45am

Is this the equivalent of, in c++, accidentally capturing a large data structure by value in a lambda used with a parallel for?

In your example, obviously you’d want to thread the outer loop, not the inner one, but I would expect it to just be creating a job whose context is the value of i, and a reference to a?
Are you saying instead its somehow deciding to clone “a” for each thread, and then sync the values back that it wrote into “a”, because it thinks it should?

vchuravy · October 11, 2021, 3:49am

This almost always means that you are running into something funny going on with the closure created by @threads. performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub My rule of thumb is that if adding @threads to a for-loop slows down the single-threaded case significantly you probably have a type-stability issue.

Kinda. It’s probably Threads.@threads makes this function type unstable · Issue #41731 · JuliaLang/julia · GitHub. Can you post a full/standalone reproducer? Easier to explain what is going wrong if I can show you how I debug this on your code.

What is @code_warntype RenderThreaded()?

Elrod · October 11, 2021, 3:51am

LoopVectorization can thread, too. @turbo threads=true or @tturbo.
However, you’d have to reinterpret the arrays into arrays of primitive types for it to work.

Chris_Green · October 11, 2021, 4:01am

Polyester sure helped. On the 12 core system for generating a 32x32 image:

Threads.@threads is actually slower than serial by 1/3rd, and does 120K allocations adding up to 7.8 mbytes

Polyester’s @batch with per=thread is >9x the speed of single threaded and does 124 allocations adding up to 24k.

On the 96 HWThread AWS box, for a 128x128 image, it actually gets a 106x speedup. I would guess the superlinear effect comes from making use of 48 L1+L2 caches instead of 1, and maybe a little from NUMA

vchuravy · October 11, 2021, 4:03am

I tried reproducing it:

using Colors
using StaticArrays

struct Vector3D{T} <: FieldVector{3, T}
           x::T
           y::T
           z::T
       end

const v3f=Vector3D{Float32}


function TraceRay( vPos::Vector3D, vDir::Vector3D )::v3f
           local vCurPos = vPos;
           local vStep = vDir * 0.02f0;
           local vColor = v3f( 0f0, 0f0, 0f0 )
           
           local flOpacity::Float32 = 1.0
           for i in 0:50
               local density = 0.1 #Changed
               local lighting = 0.1 #Changed
               vColor = vcolor .+ flOpacity * density * lighting #Changed
               flOpacity *= (1f0 - density)
               vCurPos += vStep
           end
           return vColor
       end
function RenderThreaded()
           local myimage = Matrix{v3f}(undef, 32, 32 )

           Threads.@threads for y in 1:size( myimage, 2 )
               for x in 1:size( myimage, 1 )
                  local flX::Float32 = -.5f0 + x / size( myimage, 1 )
                  local flY::Float32 = -.5f0 + y / size( myimage, 2 )
                  myimage[x,y] = TraceRay( v3f(flX, flY, 0 ), v3f( 0, 0, 1) )
               end
           end
           return reinterpret( RGB{Float32}, myimage )
       end

and on both Julia 1.6/1.7

ulia> @time RenderThreaded();
  0.545423 seconds (1.22 M allocations: 73.109 MiB, 10.16% gc time, 99.97% compilation time)

julia> @time RenderThreaded();
  0.000140 seconds (7 allocations: 12.656 KiB)

… Let’s not jump to other tools, before we understand why @threads is causing allocations.

Chris_Green · October 11, 2021, 4:11am

Hi. I’ll assemble a single file to reproduce - I want to try it at work tomorrow on
a 128 core/256thread windows machine. I’m curious to find out if Julia has the right thread setup code to work around the windows “processor group” issue on systems with >64 hardware threads.

Oscar_Smith · October 11, 2021, 4:23am

If you have more than 16 cores, you don’t want to be on Windows (in general). Windows thread scheduling is mediocre at best, and NTFS can easily cause 10x slowdowns compared to modern filesystems in programs that are using IO.

Chris_Green · October 11, 2021, 4:31am

Well, I’m in the game industry. Development is generally on windows, with large teams using in house tools that scale to a lot more than 16 cores.

I agree about the thread scheduling lacking versatility compared to linux, and there are specific things that really scale better like the filesystem as you mention.
But I wouldn’t have wasted the $ on a 44 core and a 128 core windows box if it wasn’t worth it

Topic		Replies	Views
Increased allocations when using threads Performance question	20	284	July 11, 2024
Multithreading increases memory allocations Performance	3	287	September 28, 2023
Understanding Allocations in multithreaded code Performance multithreading	4	1276	June 15, 2020
Threads memory allocations General Usage	2	490	January 22, 2020
--track-allocation and @threads General Usage question , multithreading	10	765	August 1, 2022

Large memory allocation when a loop is threaded, none when run single-threaded

Related topics