Very different performance in different computers due to garbage collection?

So, I have a code which calculates a measure of “openness” or “sprawl” around points in a series of big rasters from the National Land Cover Dataset. I had used this code on my personal laptop (A 2020 ROG G14 with a Ryzen 4800HS) and the whole thing ran more or less in one night for all years that this dataset is available.

I had to run it again with a small correction, so I ran it on my new desktop at my PhD office (Dell Prcision 3650 with Intel Xeon W-1350) and it took FOREVER to run. This was very strange, so I looked at this old benchmark that I made when I was deciding whether to run the code in Python or Julia:

using Images

function count_notocean(arr, mask)
    if size(mask) == masksiz 
        return count(i -> i !=11, arr[maskindex])
    else 
        tempindex = findall(i -> i == 0, mask) 
        return count(i -> i !=11, arr[tempindex])
    end
end

function count_developed(arr, mask)
    if size(mask) == masksiz
        temp = arr[maskindex]
        return (2*count(i -> i == 21, temp) + 4*count(i -> i == 22, temp) + 6*count(i -> i == 23, temp) + 8*count(i -> i == 24, temp))>>4   #Approximation fully in integers
    else
        tempindex = findall(i -> i == 0, mask) 
        temp = arr[tempindex]
    return (2*count(i -> i == 21, temp) + 4*count(i -> i == 22, temp) + 6*count(i -> i == 23, temp) + 8*count(i -> i == 24, temp))>>4   #Approximation fully in integers
    end
end

function point_sprawl(arr, i, j, mask)
    temprows = (max(i-siz,1), min(i+siz, size(arr)[1]))
    tempcols = (max(j-siz,1), min(j+siz, size(arr)[2]))
    temp = arr[temprows[1]:temprows[2], tempcols[1]:tempcols[2]]    
    tempmask =  mask[(temprows[1]:temprows[2]).-i.+siz.+1, (tempcols[1]:tempcols[2]).-j.+siz.+1]
    developed = count_developed(temp, tempmask)
    notocean = count_notocean(temp, tempmask)
    undeveloped = notocean-developed
    spr = trunc(undeveloped/notocean*255)  
    return spr
end

function calc_sprawl(arr, mask)
    spr = zeros(UInt8, size(arr)[1], size(arr)[2]).+255
    Threads.@threads for i in siz:(size(arr)[1]-siz)
        Threads.@threads for j in siz:(size(arr)[2]-siz)
            if arr[i,j] == 21 || arr[i,j] == 22 || arr[i,j] == 23 || arr[i,j] == 24 
                spr[i,j] = point_sprawl(arr, i, j, mask)
            else 
                nothing
            end
        end
    end
    return spr    
end

cd("D:/thesis_suburb")

rs= load("code_replication/test/austin.png")
rs = reinterpret(UInt8, rs)
rst = copy(rs)

mask = load("code_replication/luts/circle_mask.png")
mask = reinterpret(UInt8, mask)
const maskindex = findall(i -> i == 0, mask)

const masksiz = size(mask)
const siz = Int(trunc(size(mask)[1]/2))

function main_mask()
    sprawl = calc_sprawl(rst, mask)
end

@time(sprawl_single = main_mask())

All this does is take a test image from the NLCD and compute the “sprawl” index for pixels which have codes for being built-up. The code I’m using is much larger but here’s where the performance differences are.

This is what I get from the laptop:

julia> @time(sprawl_single = main_mask())
  3.815631 seconds (4.84 M allocations: 6.051 GiB, 49.27% gc time)

And this is what I get from the desktop:

julia> @time(sprawl_single = main_mask())
 52.764730 seconds (4.84 M allocations: 6.051 GiB, 97.21% gc time)

This is running Julia 1.7.3 in both computers, both through the REPL through the VSCode extension, environment installations are a copy of one another through the manifest and files are mirrored through Syncthing. The laptop uses Windows 11 Pro, the desktop uses Windows 10 Pro for Workstations. Any clues as to why the same simple code might perform so differently in two different machines? Why is the garbage collector acting up on the desktop and taking so much time to perform its tasks?

2 Likes

Maybe one of the machines is using swap space? (both are doing a log of gc and allocations, so better would be to solve that problem).

Maybe: add @views to @views function point_prawl… to solve the allocations of slices, and if that is not enough, the other critical line is, it appears, the one where you define tempmask. You seem to be allocating a lot of intermediate arrays there to define the indexes.

things like this also allocate intermediate arrays which are not necessary. You can do here, for example:

cnt = 0
for index in mask
    if index == 0 && arr[index] != 11
        cnt += 1
    end
end
return cnt

(probably you can have a one-liner for that, but not sure how clearer would it be)

1 Like

How much RAM does each machine have? It seems like the office machine is having to GC a lot more often, which likely means it is running out of memory.

Reducing allocations using views is likely your best bet for improving performance in both cases.

Okay, so thanks a lot for the @views tip, because i wasn’t sure whether the compiler would be using views of arrays or whether it would make more allocations, but execution time has been cut to a third so it’s clear it was making allocations. Still, the problem remains, I will look up what could be the problem with the swap space.

This is the desktop now:

@time(sprawl_single = main_mask())
 18.826642 seconds (2.42 M allocations: 2.699 GiB, 95.79% gc time)

And this is the laptop:

@time(sprawl_single = main_mask())
  1.707295 seconds (2.42 M allocations: 2.699 GiB, 45.44% gc time)

The thing with tempindex is more complicated, in the “real” code I access the dataset in chunks with a margin in each side to calculate the sprawl index and write it in windows to a different dataset of the same dimensions. I calculate maskindex once so that it’s a const and I never have to calculate it again. Tempindex is only there for edge cases, and in the “real” code I actually calculate the index in a slightly different way.

The code is not very memory intensive, besides the laptop has 16GB and the desktop has 64GB, both are DDR4 3200. It’s such a mystery…

Very strange indeed.

If you want to try and reduce/eliminate most of the allocations (will speedup as 95% is just cleaning up memory), take a look at this talk - Hunting down allocations with Julia 1.8's Allocation Profiler | JuliaCon 2022. It’s for Julia 1.8 so you would have to upgrade, but there’s a source code view which tells you which lines of your code are allocating and how much. This can be very helpful when optimising your code. It seems worth it in this case if you have to run it overnight.

1 Like

Also take a look at: Common allocation mistakes

1 Like

Alright, so I tried simpy adding the @views macro before every one of the small functions there, since I took care to be memory safe from the beginning. The results are the same, but now:

This is the desktop:

@time(sprawl_single = main_mask())
  1.383461 seconds (5.84 k allocations: 68.986 MiB)

And the laptop:

@time(sprawl_single = main_mask())
  1.121228 seconds (5.47 k allocations: 68.959 MiB)

It seems that the @views macro is very important here, since performance was increased massively. I guess the problem is not really solved still, it seems like the garbage collector is extremely slow for some reason on the desktop computer…

2 Likes

Hmmm… I work for Dell in HPC. I admit to not knowing much about Precision workstations.
I would ask that you review the BIOS settings. When setting up HPC systems we set the BIOS for performance.
Worth asking also if you have hyperthreading enabled on the desktop.

Also there can be effects on memory performance if you do not fill all the memory channels.
Do you know the number and placement of DIMMs in your system

1 Like

So, hyperthreading is enabled in the desktop, all of these runs are with threading enabled in Julia. What’s bizarre is that in the last example where gc does not enter the picture, the speed difference between the laptop (8 core Ryzen 2) and the desktop (6 core Rocket Lake) is reasonable, and actually memory bandwidth should be a bottleneck in this code because it’s reading many, many times.

I already told the IT at my university about this, so I will check settings like those later, about the memory, I know it has 2 DIMMs installed out of 4 slots.

For the record, I am also using Python and parallelizing with Dask, when using Python the speed of this desktop is what I would expect, compared to my laptop and to a remote server I am also using for the project. Benchmarks like Geekbench and UserBenchmark also give the expected results, it’s only Julia’s garbage collector messing up :roll_eyes:

If this was a server I would ask for the report from the IDRAC which shows the slots the memory DIMMs occupy.
I think if you reboot and go into the BIOS you can get this information.
The question then for me is what is the optimum configuration with 2 DIMMS… errr…

I ran Memory Mark from PassMark on both laptop and desktop computer, both have dual-channel DDR4 3200 and the results in memory read-write and latency are very similar, with a small advantage for the desktop. This is definitely not an overall performance defficiency from the desktop, but something specific to Julia.

1 Like

Then I am leading myself up a wrong path here. In general when discussing performance problems I start by looking at the BIOS settings and the hardware layout.
Clearly you have run a benchmark which gives useful information so I should think of something else!

Try @timev to get a bit more info about GC.

1 Like

To exclude other sources of noise, maybe try to run from a simple Julia section (outside VSCode).

(Also why not just check if GC.enable(false) before the runs effectively restores the expected performances).

1 Like

Answering to both @lmiq and @jeff.bezanson

Originally I saw the performance issue when calling the Julia executable through Python’s subprocess.call(...) function from an Anaconda environment (I was expecting the script to run while I had a morning coffee, instead it took all day and froze my desktop in the process).

This behaviour is the same from Jupyter Notebook, the VSCode extension for Julia, and from the Julia executable itself. I installed Julia 1.8.1 to run the same test, with exactly the same results. Disabling the Garbage Collector in 1.7.3 gets me this performance on the problematic desktop:

@time(sprawl_single = main_mask())
  1.423722 seconds (4.84 M allocations: 6.051 GiB)

So it’s clearly the garbage collector slowing the code down.

This is what I get if I use @timev on the first unoptimized code that I posted, on my laptop, with GC on:

@timev(sprawl_single = main_mask())
  3.619570 seconds (4.84 M allocations: 6.051 GiB, 50.07% gc time)
elapsed time (ns): 3619569900
gc time (ns):      1812208800
bytes allocated:   6497127696
pool allocs:       4842688
non-pool GC allocs:390
malloc() calls:    392
GC pauses:         7

And this is what I get on the desktop:

@timev(sprawl_single = main_mask())
 51.495278 seconds (4.84 M allocations: 6.051 GiB, 97.23% gc time)
elapsed time (ns): 51495278200
gc time (ns):      50066420000
bytes allocated:   6497131792
pool allocs:       4842763
non-pool GC allocs:390
malloc() calls:    392
GC pauses:         6

As far as I understand, they are doing almost exactly the same things?

2 Likes

I wrote a small variation of the code that runs with some random data, for easier testing, here:

code
using Images

function count_notocean(arr, mask, maskindex)
    masksiz = size(mask)
    if size(mask) == masksiz 
        return count(i -> i !=11, arr[maskindex])
    else 
        tempindex = findall(i -> i == 0, mask) 
        return count(i -> i !=11, arr[tempindex])
    end
end

function count_developed(arr, mask, maskindex)
    masksiz = size(mask)
    if size(mask) == masksiz
        temp = arr[maskindex]
        return (2*count(i -> i == 21, temp) + 4*count(i -> i == 22, temp) + 6*count(i -> i == 23, temp) + 8*count(i -> i == 24, temp))>>4   #Approximation fully in integers
    else
        tempindex = findall(i -> i == 0, mask) 
        temp = arr[tempindex]
    return (2*count(i -> i == 21, temp) + 4*count(i -> i == 22, temp) + 6*count(i -> i == 23, temp) + 8*count(i -> i == 24, temp))>>4   #Approximation fully in integers
    end
end

function point_sprawl(arr, i, j, mask)
    siz = trunc(Int,size(mask,1)/2)
    temprows = (max(i-siz,1), min(i+siz, size(arr)[1]))
    tempcols = (max(j-siz,1), min(j+siz, size(arr)[2]))
    temp = arr[temprows[1]:temprows[2], tempcols[1]:tempcols[2]]    
    tempmask =  mask[(temprows[1]:temprows[2]).-i.+siz.+1, (tempcols[1]:tempcols[2]).-j.+siz.+1]
    developed = count_developed(temp, tempmask, maskindex)
    notocean = count_notocean(temp, tempmask, maskindex)
    undeveloped = notocean-developed
    spr = trunc(undeveloped/notocean*255)  
    return spr
end

function calc_sprawl(arr, mask, maskindex)
    siz = Int(trunc(size(mask)[1]/2))
    spr = zeros(UInt8, size(arr)[1], size(arr)[2]).+255
    Threads.@threads for i in siz:(size(arr)[1]-siz)
        Threads.@threads for j in siz:(size(arr)[2]-siz)
            if arr[i,j] == 21 || arr[i,j] == 22 || arr[i,j] == 23 || arr[i,j] == 24 
                spr[i,j] = point_sprawl(arr, i, j, mask)
            else 
                nothing
            end
        end
    end
    return spr    
end

function run(N=1000)
  rst = rand(UInt8, N, N)
  mask = rand(UInt8, N, N)
  maskindex = findall(i -> i == 0, mask)
  @time(calc_sprawl(rst, mask, maskindex))
  nothing
end

In my laptop, with N=20_000 it gives:

julia> run(20_000)
  2.153127 seconds (1.32 M allocations: 3.413 GiB, 31.20% gc time, 34.38% compilation time)

julia> run(20_000)
  1.403876 seconds (104 allocations: 3.353 GiB, 10.94% gc time)

With N=30_000 it allocates ~7GB, as in your example, and the second time I tried to run it my laptop crashed.

Anyway, if this reproduces your problem, it may be easier for others to debug.

1 Like

This is interesting since (1) the number of pauses is different (and one fewer in the slower case!), and (2) usually full collections are the slowest thing in GC, but there are none here. Maybe a good next step would be to profile the slow case and look at the results with C=true to see where in GC the time is going.

2 Likes

Damn stupid reply from me. I had a case once on Linux, on an SGI Ultraviolet.
The Linxu kernel OOM killer kicked in to kill a process. However the OOM killer then had to go off and examine EVERY page in memory to see if had been touched by this process.
This took a LOOONG time as the Ultraviolet had a terabyte of memory.
The system stopped doing anything and went into kernel mode for the time it took to do this check.
By the way, there is now a flag in the Linux OOM which says do or do not check

Are we seeing some difference between Windows 11 and Windows 10 - is the system spending lots of time examining pages of memory? But then again that is what a garbage collector DOES…

What is the equivalent in Windows of spending time in kernel mode?

1 Like

@lmiq Okay, quick reply, the “mask” is a circular mask to take circles with one square kilometer area. Each pixel in the database is a 30*30 meter square. Therefore, the mask is exactly this:

Code to paste in REPL to get the mask
mask = UInt8[0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff; 
0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff; 
0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff; 
0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff; 
0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff; 
0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff; 
0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff 0xff]

The point of the code is that you move point by point around the raster sampling this specific circular shape and calculating the proportion of unbuilt pixels in this circular shape. If the mask was the same size as the raster to process AND if it was random, I imagine bad things would happen :sweat_smile:

@jeff.bezanson I ran the same diagnostic in both the laptop and the desktop, this is what I got after making a run for compilation:

HTML file with profiling for the desktop

HTML file with profiling for the laptop

Obviously they look very different, considering #61 is half the time for the laptop and a tiny proportion of time for the desktop, which is mired in garbage collection; but I’m not sure I’m qualified to interpret this…

@johnh I’m obviously not a computer science expert, but my gut tells me there has to be some kind of problem of that kind where the OS is messing with the way Julia works?
I don’t think this has to do with Windows 11 vs 10. I tried all of this on yet a different computer, I have access to a remote server where I run codes that take a really long time. It’s a virtual machine running on an old server which kind of splits it in half, my advisor uses the other half sometimes. It has a Xeon E7-4870 and I get 128GB of RAM. These codes behave there exactly like on my laptop, only they run twice as slow because it’s old, but the relative time spent in garbage collection is the same. So it seems to be something about this particular computer…

I will go home now, tomorrow I might pester one of my friends to run it on their computer. All PhD students in the cohort just got identical workstations, they are identical models and they should be configured identically, so I’ll see if the problem is reproduced there.