Unexpectedly high memory allocation when running n-body simulation

I wrote the n-body simulation in Julia, but I’m experiencing unexpectedly high memory allocation (according to the @time macro) when running the simulation for 10.000 bodies (the input file I’m using test1e4.julia.txt · GitHub). I tried all the performance recommendations, I replaced array accesses with views, using column-major iterating, but nothing seems to work.

Here is the code:

using DelimitedFiles

const G = 6.67e-11
const dt = 0.1

function main()
    data = readdlm("../data/test1e4.julia.txt")
    n = size(data, 1)    
    positions::Array{Float64, 2} = data[:, 1:3]'
    velocities::Array{Float64, 2} = data[:, 4:6]'
    masses::Array{Float64, 1} = data[:, 7]

    println("Done reading.")
    
    forces::Array{Float64, 2} = zeros(3, n)
    # sum_forces = @time @sync @distributed (+) for q = 1:n
    @inbounds for q = 1:n
        position_q_view = view(positions, :, q)
        force_q_view = view(forces, :, q)
        for k = 1:n
            if k <= q
                continue
            end
            force_k_view = view(forces, :, k)

            diff::Array{Float64} = position_q_view - view(positions, :, k)
            dist::Float64 = sqrt(sum(diff .^ 2))
            dist_cubed::Float64 = dist ^ 3

            tmp::Float64 = -G * masses[q] * masses[k] / dist_cubed
            force_qk::Array{Float64} = tmp * diff
            force_q_view = force_q_view + force_qk
            force_k_view = force_k_view - force_qk
        end
    end

    @inbounds for q = 1:n
        @views positions[:, q] += (dt * velocities[:, q])
        @views velocities[:, q] += (dt / masses[q] * forces[:, q])
    end
end

Here are the @time runs:

julia> @time main()
Done reading.
 42.568147 seconds (405.27 M allocations: 32.289 GiB, 5.73% gc time)

julia> @time main()
Done reading.
 42.548842 seconds (400.27 M allocations: 32.049 GiB, 6.01% gc time)

I also ran it with --track-allocations=user and these are the results:

julia> analyze_malloc(".")
14-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(0, "./main.jl.mem", 21)
 Coverage.MallocInfo(0, "./main.jl.mem", 37)
 Coverage.MallocInfo(1312, "./main.jl.mem", 8)
 Coverage.MallocInfo(10816, "./main.jl.mem", 13)
 Coverage.MallocInfo(240080, "./main.jl.mem", 17)
 Coverage.MallocInfo(480256, "./main.jl.mem", 10)
 Coverage.MallocInfo(960000, "./main.jl.mem", 20)
 Coverage.MallocInfo(960000, "./main.jl.mem", 38)
 Coverage.MallocInfo(2740389, "./main.jl.mem", 11)
 Coverage.MallocInfo(5440000, "./main.jl.mem", 39)
 Coverage.MallocInfo(16222886, "./main.jl.mem", 9)
 Coverage.MallocInfo(5599440000, "./main.jl.mem", 33)
 Coverage.MallocInfo(10398960000, "./main.jl.mem", 26)
 Coverage.MallocInfo(18398160000, "./main.jl.mem", 32)

I’m quite new to Julia, so there might be overlooking a basic performance issues but I cant seem to narrow it down. Thank you for your help.

1 Like

First of all, it would be better if you posted self-contained examples (we don’t have “…/data/test1e4.julia.txt”).

Second, part of the problem is view. For various reasons involving memory management / visibility to the garbagee collector, view is not a zero-cost abstraction. Sometimes the compiler manages to make it zero-cost, but often it needs to heap-allocate views.

It looks like view(forces, :, k) etc always have length 3? In that case, you should be using StaticArrays. Then forces::Vector{SVector{3, Float64}}, and forces[k] is a 3-vector that is stack-allocated. Plus all loops over the components get fully unrolled (possibly using SIMD instructions) and you avoid all the ugly branches.

Third, diff = ... allocates a new vector. Afaik this is never optimized out by the compiler. You need to put the allocation outside the loop and reuse it (diff .= ...). In your specific case, you should make it an SVector instead; then everything happens on the stack.

3 Likes

Using StaticArrays you could write something like this

using DelimitedFiles
using StaticArrays

const G = 6.67e-11
const dt = 0.1

function main()
    data = readdlm("data.txt")::Array{Float64, 2}
    n = size(data, 1)
    positions  = [SVector{3}(data[i, 1:3]) for i in 1:size(data, 1)]
    velocities = [SVector{3}(data[i, 4:6]) for i in 1:size(data, 1)]
    masses = data[:, 7]

    println("Done reading.")

    forces = fill(zero(SVector{3, Float64}), n)

    @inbounds for q = 1:n
        for k = 1:n
            k <= q && continue

            diff = positions[q] - positions[k]
            dist = sqrt(sum(diff .^ 2))
            dist_cubed = dist ^ 3

            tmp = -G * masses[q] * masses[k] / dist_cubed
            force_qk = tmp * diff
            forces[q] += force_qk
            forces[k] -= force_qk
        end
    end

    @inbounds for q = 1:n
        positions[q] += (dt * velocities[q])
        velocities[q] += (dt / masses[q] * forces[q])
    end
end

which has (probably) good performance

julia> @time main()
Done reading.
  0.740649 seconds (226.56 k allocations: 10.159 MiB)

Also,

            force_q_view = force_q_view + force_qk
            force_k_view = force_k_view - force_qk

unlikely does what you want. You probably wanted something like

            force_q_view .+= force_qk
            force_k_view .-= force_qk

to update the original array. Right now it is just a no-op (assignment in julia is always a “no-op”).

8 Likes

@foobar_lv2 Hey, I did post a link to the file I’m using in the original post: test1e4.julia.txt · GitHub Sorry, if I wasn’t clear enough that this is the input file I’m using. Also thanks for explaining the view limitations.

@kristoffer.carlsson Yep, your implementation gives a big performance boost (down to ~2s on my machine, from 40s). I’m writing my thesis regarding parallel programming languages (Chapel, Julia) and how they could replace existing frameworks like (C+OpenMP/MPI). So every performance tip is of great value and I’m really trying to close the gap on C :smile:

Thanks to both of you, for introducing me to the StaticArrays package. Comming from the C world, I’m used to working with mostly fixed sized arrays :slight_smile:

1 Like