Specific help regarding allocations and performance

Hello!

I am working on a tool in Julia which allows one to redistribute an initial set of particles inside a profile based on SPH (smoothed particle hydrodynamics), to get a more uniform distribution. The algorithm is based on this paper, https://www.researchgate.net/publication/256688486_Particle_packing_algorithm_for_SPH_schemes, for the interested. In pictures this means that one goes from left to right, so that an initial circular distribution inside a square profile becomes distributed homogeneously:

I have uploaded my Julia file and one relevant data file here: GitHub - AhmedSalih3d/ParticlePacking.jl: Implementation of Particle Packing Algorithm in Julia , but what I specifically want help with is the “PackStep” function I made:

#----------------------------------- PackStep -----------------------------------#
# Put one time step of the algorithm in a function
# Inputs are:
#   1. Initial position of particles, pg
#   2. Initial velocity of particles, updated
#   3. Total   number   of particles, nCoords
function PackStep(pg,u,nCoords)
        # Focus on one particle at at a time, iter
        for iter = 1:nCoords
            # Calculate difference between position of particle iter minus all other particles
            rij = map(x -> pg[iter] .- x, pg)

            # Using LinearAlgebra, calculate the Euclidean distance between iter and each point
            RIJ = norm.(rij)

            # Transform it to a normalized distance
            q   = RIJ ./ H

            # For some reason it becomes a 2D array, I prefer it as 1D..
            q   = dropdims(q,dims=2)

            # Find all particles in near vicinity
            indices = q.<=2;

            # Set the influence of the current particle in focus, iter, to zero
            indices[iter] = 0

            # Extract only the normalized distance from the closest particles to iter
            qc  = q[indices]

            # Calculate main body of the Wendland Kernel derivative, dW/dq
            qq3 =  qc .* (qc .- 2).^3;

            # Calculate the actual dW/dq
            Wq  = AD*FAC*qq3

            # Extract the distance, between, 
            # (x_i[1],x_i[2],x_i[3]) -  (x_j[1],x_j[2],x_j[3]), 
            # which was calculated earlier in "rij"
            x_ij   = first.(rij[indices]);
            z_ij   = last.(rij[indices]);

            # From C++ I have learned to divide before hand.. maybe it doesn't make sense to do so in Julia
            RIJ1 = 1 ./ RIJ[indices];

            # Calculate actual gradient values for particle iter
            Wgx  = sum(Wq .* (x_ij .* RIJ1) * H1);
            Wgz  = sum(Wq .* (z_ij .* RIJ1) * H1);

            # Extract ux and uz components and make it 1D array
            ux   = dropdims(first.(u),dims=2);
            uz   = dropdims(last.(u),dims=2);

            # Calculate change of velocity for particle iter
            dux  = (-BETA .* Wgx * V - ZETA .* ux[iter])*DT;
            duz  = (-BETA .* Wgz * V - ZETA .* uz[iter])*DT;

            # Calculate displacement of particle iter
            dx   = dux*DT;
            dz   = duz*DT;

            # Set new velocity and displacement
            # NOTE: I am aware I "cheat" since I base calculations
            # on updated values over time, so I halfway through
            # would have 50% updated and 50% non-updated values.
            # Since displacements are so small in this method, it
            # really does not do a big difference, if one sticks
            # to the original method
            u[iter]  = u[iter]  .+ tuple(dux,0,duz);
            pg[iter] = pg[iter] .+ tuple(dx,0,dz);
        end
end

“pg” is a array of tuple’s which holds the initial circular distribution and fixed square - u is an initial velocity field to initialize the algorithm and nCoords, is the length of number of particles in the circular distribution to ensure that one only calculates for the particles inside of the square, since the square is fixed. Performance wise it is not too bad, since for 100 time steps, it outputs:

# Time and run the main body
 @time RunAlgorithm()
 14.160436 seconds (14.46 M allocations: 50.042 GiB, 6.42% gc time)

But of course way too many allocations and memory usage for my liking. I have made my initial implementation using “tuple” since I read that they should be more effective than arrays, but this makes it a bit harder to alter “pg” based on in-place operations.

The code in the Github repository and the datafile should run seamlessly since no dependence on niche packages is used.

I have read the performance tips section, but am still struggling with optimizing this further, without starting to test @threads and parallel computing etc.

Looking forward to hear any feedback/suggestions.

Kind regards

What’s V I don’t see it defined anywhere.

Sorry, I didn’t include it since I thought it would be too long for the forum post. “V” is a scalar constant. Can be found under, “Some Constants” header. For convenience the whole code is also down below here:

# VERSION 1.0
# NAIVE IMPLEMENTATION

# Define used libraries
using DelimitedFiles
using LinearAlgebra
using Plots

# GOAL OF THIS CODE:
#   It runs an iterative algorithm on a set of particle positions given an initial
#   velocity field, with the goal of filling white space. More information here;
#   https://www.researchgate.net/publication/256688486_Particle_packing_algorithm_for_SPH_schemes

# Current Benchmark results, using BenchmarkTools, @btime:
# @btime RunAlgorithm()
# 12.940 s (14464000 allocations: 50.04 GiB)
# And using @benchmark:
#@benchmark RunAlgorithm()
#BenchmarkTools.Trial: 
#memory estimate:  50.04 GiB
#allocs estimate:  14464000
#--------------
#minimum time:     12.401 s (3.89% GC)
#median time:      12.401 s (3.89% GC)
#mean time:        12.401 s (3.89% GC)
#maximum time:     12.401 s (3.89% GC)
#--------------
#samples:          1
#evals/sample:     1


# This code consists of:
#   1. drawSquare: 
#           Generate particles in a square shape. The square is not filled.
#   2. read2DParticleGroup: 
#           Reads a particle distribution and hard-coded use of drawSquare
#   3. Some Constants: 
#           These are hardcoded for these exact values
#   4. PackStep: 
#           One step of the algorithm, which moves the particles towards empty space
#   5. RunAlgorithm():
#           Runs the final algorithm

# Function to draw the a square of points
# Inputs are:
#   1. Left corner of square horizontal coordinate, x
#   2. Left corner of square vertical   coordinate, y
#   3. Width       of square                      , s
#   4. Number      of particles on each edge      , n

#---------------------------- drawSquare -----------------------------------------#
function drawSquare(x,y,s,n)
    points = [x   y
              x+s y
              x+s y+s
              x   y+s]
    squarePoints = zeros(4*n,2); 
    
    squarePoints[1:n,1] = range(points[1,1],points[2,1],length=n)
    squarePoints[1:n,2] = range(points[1,2],points[2,2],length=n)
    
    squarePoints[n+1:2*n,1] = range(points[2,1],points[3,1],length=n)
    squarePoints[n+1:2*n,2] = range(points[2,2],points[3,2],length=n)
    
    squarePoints[2*n+1:3*n,1] = range(points[3,1],points[4,1],length=n)
    squarePoints[2*n+1:3*n,2] = range(points[3,2],points[4,2],length=n)
    
    squarePoints[3*n+1:4*n,1] = range(points[4,1],points[1,1],length=n)
    squarePoints[3*n+1:4*n,2] = range(points[4,2],points[1,2],length=n)
    return squarePoints
end

#---------------------------- read2DParticleGroup --------------------------------#
#Function which constructs the relevant initial particle distribution, pg
function read2DParticleGroup(filename::AbstractString)
    coords = readdlm(filename, '\t', Float64, '\n');
    nCoords = size(coords)[1]
    s1 = drawSquare(-1.025,-1.025,2.05,100);
    ns1 = size(s1)[1]

    pg = fill(tuple(0.0,0.0,0.0),nCoords+ns1,1)

    for i = 1:nCoords
        xval = coords[i,1];
        zval = coords[i,2];
        pg[i] = tuple(xval,0,zval)
    end

    for i = 1:ns1
        j = i+nCoords
        xval = s1[i,1];
        zval = s1[i,2];
        pg[j] = tuple(xval,0,zval)
    end

    return pg,nCoords
end
#---------------------------- Some Constants ------------------------------------#
# Set some constants
const H = 0.04
const AD = 348.15;
const FAC = 5/8;
const BETA = 4;
const ZETA = 0.060006;
const V    = 0.0011109;
const DT   = 0.016665;
const H1   = 1/H;

#----------------------------------- PackStep -----------------------------------#
# Put one time step of the algorithm in a function
# Inputs are:
#   1. Initial position of particles, pg
#   2. Initial velocity of particles, updated
#   3. Total   number   of particles, nCoords
function PackStep(pg,u,nCoords)
        # Focus on one particle at at a time, iter
        for iter = 1:nCoords
            # Calculate difference between position of particle iter minus all other particles
            rij = map(x -> pg[iter] .- x, pg)

            # Using LinearAlgebra, calculate the Euclidean distance between iter and each point
            RIJ = norm.(rij)

            # Transform it to a normalized distance
            q   = RIJ ./ H

            # For some reason it becomes a 2D array, I prefer it as 1D..
            q   = dropdims(q,dims=2)

            # Find all particles in near vicinity
            indices = q.<=2;

            # Set the influence of the current particle in focus, iter, to zero
            indices[iter] = 0

            # Extract only the normalized distance from the closest particles to iter
            qc  = q[indices]

            # Calculate main body of the Wendland Kernel derivative, dW/dq
            qq3 =  qc .* (qc .- 2).^3;

            # Calculate the actual dW/dq
            Wq  = AD*FAC*qq3

            # Extract the distance, between, 
            # (x_i[1],x_i[2],x_i[3]) -  (x_j[1],x_j[2],x_j[3]), 
            # which was calculated earlier in "rij"
            x_ij   = first.(rij[indices]);
            z_ij   = last.(rij[indices]);

            # From C++ I have learned to divide before hand.. maybe it doesn't make sense to do so in Julia
            RIJ1 = 1 ./ RIJ[indices];

            # Calculate actual gradient values for particle iter
            Wgx  = sum(Wq .* (x_ij .* RIJ1) * H1);
            Wgz  = sum(Wq .* (z_ij .* RIJ1) * H1);

            # Extract ux and uz components and make it 1D array
            ux   = dropdims(first.(u),dims=2);
            uz   = dropdims(last.(u),dims=2);

            # Calculate change of velocity for particle iter
            dux  = (-BETA .* Wgx * V - ZETA .* ux[iter])*DT;
            duz  = (-BETA .* Wgz * V - ZETA .* uz[iter])*DT;

            # Calculate displacement of particle iter
            dx   = dux*DT;
            dz   = duz*DT;

            # Set new velocity and displacement
            # NOTE: I am aware I "cheat" since I base calculations
            # on updated values over time, so I halfway through
            # would have 50% updated and 50% non-updated values.
            # Since displacements are so small in this method, it
            # really does not do a big difference, if one sticks
            # to the original method
            u[iter]  = u[iter]  .+ tuple(dux,0,duz);
            pg[iter] = pg[iter] .+ tuple(dx,0,dz);
        end
end

#---------------------------- RunAlgorithm ------------------------------------#
function RunAlgorithm()
    # Generate initial data and number of data points
    pg,nCoords = read2DParticleGroup("CircleGridUniform.txt");

    # Generate initial velocity field
    u = map(x -> x.*tuple(1,0.0,-1).*50,pg);
    u[nCoords+1:end] .= tuple.(0.0,0.0,0.0)

    for t = 1:100
        PackStep(pg,u,nCoords)
    end

    # Drop final result
    plot(dropdims(first.(pg),dims=2),dropdims(last.(pg),dims=2), 
        seriestype = :scatter, 
        title = "My Scatter Plot", 
        aspect_ratio=1)
end

# Time and run the main body
@time RunAlgorithm()

In that case, you should probably write (-BETA .* Wgx .* V .- ZETA .* ux[iter]).*DT; instead of (-BETA .* Wgx * V - ZETA .* ux[iter])*DT; since this will cause everything to be fused and prevent 3 allocations per loop.

So I did that now, basically like this:

function PackStep(pg,u,nCoords)
        # Focus on one particle at at a time, iter
        for iter = 1:nCoords
            # Calculate difference between position of particle iter minus all other particles
            rij = map(x -> pg[iter] .- x, pg)

            # Using LinearAlgebra, calculate the Euclidean distance between iter and each point
            RIJ = norm.(rij)

            # Transform it to a normalized distance
            q   = RIJ ./ H

            # For some reason it becomes a 2D array, I prefer it as 1D..
            q   = dropdims(q,dims=2)

            # Find all particles in near vicinity
            indices = q.<=2;

            # Set the influence of the current particle in focus, iter, to zero
            indices[iter] = 0

            # Extract only the normalized distance from the closest particles to iter
            qc  = q[indices]

            # Calculate main body of the Wendland Kernel derivative, dW/dq
            qq3 =  qc .* (qc .- 2).^3;

            # Calculate the actual dW/dq
            Wq  = AD.*FAC.*qq3

            # Extract the distance, between, 
            # (x_i[1],x_i[2],x_i[3]) -  (x_j[1],x_j[2],x_j[3]), 
            # which was calculated earlier in "rij"
            x_ij   = first.(rij[indices]);
            z_ij   = last.(rij[indices]);

            # From C++ I have learned to divide before hand.. maybe it doesn't make sense to do so in Julia
            RIJ1 = 1 ./ RIJ[indices];

            # Calculate actual gradient values for particle iter
            Wgx  = sum(Wq .* (x_ij .* RIJ1) .* H1);
            Wgz  = sum(Wq .* (z_ij .* RIJ1) .* H1);

            # Extract ux and uz components and make it 1D array
            ux   = dropdims(first.(u),dims=2);
            uz   = dropdims(last.(u),dims=2);

            # Calculate change of velocity for particle iter
            dux  = (-BETA .* Wgx .* V .- ZETA .* ux[iter]).*DT;
            duz  = (-BETA .* Wgz .* V .- ZETA .* uz[iter]).*DT;

            # Calculate displacement of particle iter
            dx   = dux.*DT;
            dz   = duz.*DT;

            # Set new velocity and displacement
            # NOTE: I am aware I "cheat" since I base calculations
            # on updated values over time, so I halfway through
            # would have 50% updated and 50% non-updated values.
            # Since displacements are so small in this method, it
            # really does not do a big difference, if one sticks
            # to the original method
            u[iter]  = u[iter]  .+ tuple(dux,0,duz);
            pg[iter] = pg[iter] .+ tuple(dx,0,dz);
        end
end

And I also used som more dots for Wgx, Wgz, Wq. This gives the result now:

# Time and run the main body
 @time RunAlgorithm()
 16.441760 seconds (13.90 M allocations: 49.933 GiB, 7.18% gc time)

Slightly slower but less allocation.

Kind regards

Your coding style is very allocation-heavy, with temporary arrays created at almost every line, with no re-use.

Off the top of my head: can you re-write PackStep as a double loop, avoiding the temporary arrays?

I have always been under the impression that having arrays better than for loops, even in Julia - which is why I did as I did, but I see your point. I can write it in a double loop and test it, will take a bit of time.

Kind regards

No, that is completely wrong. Looping is almost always the fastest way.

Thank you for highlighting that for me. I think I got carried too much away with the idea of dot notation, broadcasting etc. and influence from Matlab. I wrote the loop pretty much exactly as I had done earlier in C++;

function PackStep(pg,u,nCoords)
    for i = 1:nCoords
        global Wgx = 0;
        global Wgz = 0;
        p_i   = pg[i];

        for j = 1:(nCoords+400)
            if (i != j)
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = norm(rij);
                q   = RIJ/H;
                if (q <= 2)
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;
                    
                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij/RIJ) * H1;
                    Wgz += Wq * (z_ij/RIJ) * H1;
                end
            end
        end

        dux = (-BETA * Wgx * V - ZETA * u[i][1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u[i][3])*DT;

        dx  = dux*DT;
        dz  = duz*DT;

        u[i]  =   u[i] .+ (dux, 0, duz)
        pg[i] =  pg[i] .+ (dx,  0, dz)
    end
end

And the result is now;

# Time and run the main body
  @time RunAlgorithm()
  6.603412 seconds (24.75 M allocations: 464.756 MiB, 0.64% gc time)

And produces the same as before;

image

And now it is in my opinion even easier to work with.

If anyone has further suggestions for improving this please let me know, right now it is only about 1 second slower than the C++ code I had made earlier, so that is quite impressive.

Kind regards

Why did you mark these as globals? You should avoid globals, in general.

Also, it looks like they should be floats, that is 0.0 instead of 0.

Thanks again - and you are right it should be a float. The reason for using “global” initially was when testing I found that if I used global:

6.128403 seconds (24.79 M allocations: 466.441 MiB, 0.63% gc time)

And then if I removed it;

8.382073 seconds (75.47 k allocations: 3.075 MiB)

A lot less allocations and memory but 25% slower. Using your note of float;

Without global:

5.616380 seconds (75.47 k allocations: 3.075 MiB)

With global:

6.032726 seconds (24.79 M allocations: 466.479 MiB, 0.65% gc time)

So now both faster and less memory without globals! :slight_smile: I have also used @inbounds macro on the for loops.

Next step is try to using @threads I guess.

Kind regards

before that, try adding @avx on the outer loop (from the package LoopVectorization. It might give you a free factor of 2 or 3 by more aggressively using SIMD instructions.

Okay, I will try that. Currently I have used @threads, but that makes it important to allocate a temp array to store final results in, to ensure that I calculate properties based on same timestep for all particles. Therefore the code now looks like:

function PackStep(pg,u,nCoords,nOver)
    nTot = nCoords + nOver;
    u_tmp  = fill(tuple(0.0,0.0,0.0),(nTot,));
    pg_tmp = fill(tuple(0.0,0.0,0.0),(nTot,));
    @inbounds @threads for i = 1:nCoords
        Wgx = 0.0;
        Wgz = 0.0;
        p_i   = pg[i];

        @inbounds for j = 1:nTot
            if (i != j)
                p_j = pg[j];
                rij = p_i .- p_j;
                RIJ = norm(rij);
                RIJ1 = 1.0 / RIJ;
                q   = RIJ/H;
                if (q <= 2)
                    qq3 = q*(q-2)^3;
                    Wq  = AD * FAC * qq3;  
                    
                    x_ij = rij[1];
                    z_ij = rij[3];

                    Wgx += Wq * (x_ij * RIJ1) * H1;
                    Wgz += Wq * (z_ij * RIJ1) * H1;
                end
            end
        end

        # Define here since I use it a lot
        u_i = u[i];

        dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;

        dx  = dux*DT;
        dz  = duz*DT;

        u_tmp[i]  =   u_i .+ (dux, 0.0, duz)
        pg_tmp[i] =  pg[i] .+ (dx,  0.0, dz)
    end

    @inbounds @threads for i = 1:nCoords
        u[i]  = u_tmp[i];
        pg[i] = pg_tmp[i];
    end

end

And the results are imo quite impressive now;

@benchmark include("particle.jl")
  2.253426 seconds (217.18 k allocations: 25.918 MiB)
  1.662854 seconds (217.11 k allocations: 25.916 MiB, 0.45% gc time)
  1.627950 seconds (217.09 k allocations: 25.917 MiB, 0.95% gc time)
  1.780504 seconds (217.11 k allocations: 25.916 MiB, 0.37% gc time)
  1.937437 seconds (217.14 k allocations: 25.920 MiB)
  2.045384 seconds (217.12 k allocations: 25.916 MiB, 0.38% gc time)
  1.794380 seconds (217.14 k allocations: 25.917 MiB, 0.35% gc time)
  1.884522 seconds (217.14 k allocations: 25.917 MiB)
BenchmarkTools.Trial: 
  memory estimate:  56.53 MiB
  allocs estimate:  840294
  --------------
  minimum time:     1.974 s (0.32% GC)
  median time:      2.071 s (0.34% GC)
  mean time:        2.089 s (0.34% GC)
  maximum time:     2.222 s (0.35% GC)
  --------------
  samples:          3
  evals/sample:     1

Kind regards, going to test your suggestion

I am getting a lot of different errors, when trying to use @avx on the outermost for loop, i.e. for i = 1:nCoords . Especially at the if statements, it says:

ERROR: LoadError: LoadError: "Don't know how to handle expression:\nif i != j\n   

Is @avx difficult to use with if statements such as mine or am I doing something very wrongly?

EDIT: I also have removed all @threads and @inbounds

Kind regards

Oh, I missed that. @avx won’t work here simply since it can’t deal with you not wanting the diagonal.

I see, maybe one can rewrite it in a way so that the diagonal will not contribute in the end?

Never the less, I found @simd and @fastmath too. They give a bit of performance, but should I just @fastmath infront of every calculation?

Kind regards