Fastest way to calculate a rasterised Voronoi diagram without a GPU

Context:
Hi everyone, I’m using Turing.jl with NUTS, to infer the parameters a chaotic ODE system. Due to the chaos, doing inference on the positions of the particles as the time window of the inference gets longer becomes more and more difficult as the paths stray further and further away. Hence I am doing inference on the density of particles over time.

The “natural” way of calculating densities (that no reviewer will complain about) is using Voronoi diagrams to calculate the area around the particle, and inverting it to get the density around that particle. Hence, each time Turing calculates the log probability, it solves the ODE, calculates a Voronoi diagram on a grid at each dt, and then I perform the additional step of subsampling the density grid.

Disclaimer: This is for my Master’s thesis, if I end up using any code written by anyone else then I will credit them in the acknowledgements section unless they would prefer not to be mentioned.

Problem:
To calculate rasterised Voronoi diagrams (grids such that each cell is coloured/marked the colour of its closest seed) I am using the jump flood algorithm. Unfortunately, while this is fast, it is supposed to be done in parallel for each pixel. Given that I am running this on CPUs, I’m not sure how best to take advantage of this.

Additionally, my code (below) is not optimised for memory or speed, just to work. I’d like to reduce the memory allocated each iteration, as this is seriously slowing down the inference. I would highly appreciate any contributions to making this code faster. An easy win might be changing the datatype I’m using to store seeds from Set to something else, but I’m not sure.

function distance(x, y)
    sqrt((x[1] - y[1])^2 + (x[2] - y[2])^2)
end

function jump_flood_voronoi!(grid, original_seeds; count_colours = zeros(Int64, length(original_seeds)))
    N, M = size(grid)
    colours = 1:length(original_seeds)
    seeds = [Set{Tuple{Int64,Int64}}() for _ in colours]
    new_seeds = [Set{Tuple{Int64,Int64}}() for _ in colours]
    seeds_to_delete = [Set{Tuple{Int64,Int64}}() for _ in colours]
    for (colour, seed) in enumerate(original_seeds)
        grid[seed...] = colour
        push!(seeds[colour], seed)
        count_colours[colour] += 1
    end
    # For correctness, many jump flood algorithms run additional passes
    base_nsteps = ceil(Int, log2(max(M, N))) - 1
    extra_steps = 2
    extra_stepsizes = (2, 1)
    step = 2^base_nsteps
    ncompleted_steps = 0
    while ncompleted_steps < base_nsteps + extra_steps
        for colour in colours
            original_seed_coords = original_seeds[colour]
            for seed in seeds[colour]
                for direction in ((1, 1), (1, 0), (1, -1), (0, 1), (0, -1), (-1, 1), (-1, 0), (-1, -1))
                    i, j = Int.(seed .+ (step .* direction))
                    if all(1 .<= (i, j) .<= (N, M))
                        if grid[i, j] == colour
                            continue
                        elseif grid[i, j] == 0
                            grid[i, j] = colour
                            push!(new_seeds[colour], (i, j))
                            count_colours[colour] += 1
                        else
                            current_colour = grid[i, j]
                            current_dist = distance((i, j), original_seeds[current_colour])
                            new_dist = distance((i, j), original_seed_coords)
                            if new_dist < current_dist
                                grid[i, j] = colour
                                push!(new_seeds[colour], (i, j))
                                count_colours[colour] += 1
                                push!(seeds_to_delete[current_colour], (i, j))
                                count_colours[current_colour] -= 1
                            end
                        end
                    end
                end
            end
        end
        for colour in colours
            union!(seeds[colour], new_seeds[colour])
            setdiff!(seeds[colour], seeds_to_delete[colour])
        end
        new_seeds .= (Set{Tuple{Int64,Int64}}() for _ in colours)
        seeds_to_delete .= (Set{Tuple{Int64,Int64}}() for _ in colours)
        ncompleted_steps += 1
        if ncompleted_steps > base_nsteps
            step = extra_stepsizes[ncompleted_steps-base_nsteps]
        else
            step ÷= 2
        end
    end
    return nothing
end

Here’s an example of the code running:

using Plots

grid = zeros(Int64, 1000, 1000)
original_seeds = reinterpret(reshape, Tuple{Int64,Int64}, rand(1:1000, 2, 100))

jump_flood_voronoi!(grid, original_seeds)
heatmap(grid, aspect_ratio = 1.0, colour = palette(:viridis), legend = :none, size = (500, 500))

Currently, my benchmark says:

julia> @benchmark jump_flood_voronoi!(grid, original_seeds)
BenchmarkTools.Trial: 4204 samples with 1 evaluation.
 Range (min … max):  784.421 μs … 16.627 ms  ┊ GC (min … max):  0.00% … 91.92%
 Time  (median):     898.941 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.174 ms ±  1.133 ms  ┊ GC (mean ± σ):  16.61% ± 14.92%

  █▇▄▅▄▂                                                       ▁
  ███████▇▄▆▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▇██▇▅▅▃▁▁▃ █
  784 μs        Histogram: log(frequency) by time      7.33 ms <

 Memory estimate: 1.59 MiB, allocs estimate: 10828.
1 Like

Your benchmark results look misleading - I think you need to zero out the grid each time for a fair run.

The optimal scheme will depend on the grid size and point density. For the values in your example,
the algorithm in this recent paper, with KDTrees from the NearestNeighbors package for point tests, looks promising.

For a really big grid with high density, I suspect that the close connection to an elliptic PDE points to multigrid methods.

Both approaches can make good use of multithreading on a CPU (or several, for large problems). Regrettably I can’t provide code, but perhaps my suggestions will lead you or other readers in a fruitful direction.

2 Likes

To reduce the memory allocation, maybe you can pre-allocate the seeds,new_seeds,seeds_to_delete outside the function and unroll these lines into explicit for loop as the right side of the expression allocates a new array every time the function is called.

seeds = [Set{Tuple{Int64,Int64}}() for _ in colours]
new_seeds = [Set{Tuple{Int64,Int64}}() for _ in colours]
seeds_to_delete = [Set{Tuple{Int64,Int64}}() for _ in colours]

In fact, there is a really fast package in Julia to generate the bounded voronoi diagram and the area of each voronoi subregion, i.e., the VoronoiCells. Maybe this can help to improve the performance.

Profiling with

@btime jump_flood_voronoi!(grid, original_seeds) setup=(
    grid = zeros(Int64, 1000, 1000); 
    original_seeds = reinterpret(reshape, Tuple{Int64,Int64}, rand(1:1000, 2, 100)))

yields

  1.236 s (32131 allocations: 715.41 MiB)

Profiling with

using Profile
Profile.clear()
grid = zeros(Int64, 1000, 1000)
original_seeds = reinterpret(reshape, Tuple{Int64,Int64}, rand(1:1000, 2, 100))
jump_flood_voronoi!(grid, original_seeds)
grid = zeros(Int64, 1000, 1000)
original_seeds = reinterpret(reshape, Tuple{Int64,Int64}, rand(1:1000, 2, 100))
@profile jump_flood_voronoi!(grid, original_seeds)
Juno.profiler()

shows

I just read the Wikipedia entry on the jump flooding algorithm and the complexity of your implementation doesn’t seem right to me (counting the depth of for loops nesting…).

1 Like

NeutralLandscapes.jl has random discrete voronai using NearestNeighbors.jl. You could use that without the random sampled elements.

https://github.com/EcoJulia/NeutralLandscapes.jl/blob/301605bb8f58729604bbbeafef94aceb238bf805/src/algorithms/nnelement.jl#L21-L43

I’m not sure how fast it will be for your use-case.

I have some problems understanding, why you need rasterized Voronoi? Are you building density statistics and are fine with the granularity - or do you need to sort / act on certain, local densities and just need a fast method?

Didn’t follow this very attentively but in case it helps, GMT (via GDAL) can compute voronoi grids and I think it uses multi-threading (but not sure). See Voronoi Visualization - #4 by joa-quim

Thank you, both of your suggestions provide paths for me to look into!

Yes, I hate allocating inside the function, but I don’t think I can know the size of these arrays before the function is called, I only know how large they could possibly be, but it seems wasteful to allocate such a large array. (that said if I’m allocating it each time it’s probably still better to do so).

I was using VoronoiCells and then rasterising the result before, but it didn’t play well with Turing due to its type system.

For my use case I require the granularity eventually i.e. eventually I will need* to rasterise the result of a Voronoi diagram area calculation in order to do inference on the values of the grid generated. For this reason I am skipping the costly calculation of the Voronoi diagram itself and just doing it rasterised from the start.

This was my thinking, at least. The complexity of Fortune’s Sweep is only O(nlog(n)), which is attractive, but I thought that due to the complicatedness of the algorithm, if I tried to implement it, did a bad job, and then it turned out to be slower than what I already had then I’d be a sad cookie. I’ve looked into using packages that calculate an exact Voronoi diagram, but they don’t like being differentiated by ForwardDiff, and so aren’t usable with Hamiltonian Monte-Carlo.

*need is probably too strong, but at this point I’m just doing what I’m told and shelving any bright ideas that would take me more than a few weeks to explore fully.

There is also a fast distance transform implementation in JuliaImages.

Aren’t the most efficient methods typically fast sweep algorithms like Meijster? They are O(n).

That algorithm ended up being really nice and easy to implement, 10 times speedup over my old algorithm:

function divide_and_conquer_voronoi!(grid, original_seeds, count_colours = zeros(Int64, length(original_seeds)))
    N, M = size(grid)
    corners = ((1, 1), (N, 1), (1, M), (N, M))
    closest_colours = map(corner -> findmin(seed -> distance(corner, seed), original_seeds)[2], corners)
    if all(y -> y == closest_colours[1], closest_colours)
        grid .= closest_colours[1]
        count_colours[closest_colours[1]] += N * M
    else
        N₁ = ceil(Int, N / 2)
        M₁ = ceil(Int, M / 2)
        offsets = ((0, 0), (N₁, 0), (0, M₁), (N₁, M₁))
        tl_grid = view(grid, 1:N₁, 1:M₁)
        bl_grid = view(grid, N₁+1:N, 1:M₁)
        tr_grid = view(grid, 1:N₁, M₁+1:M)
        br_grid = view(grid, N₁+1:N, M₁+1:M)
        for (i, split_grid) in enumerate((tl_grid, bl_grid, tr_grid, br_grid))
            if all(size(split_grid) .> 0)
                new_seeds = map(seed -> seed .- offsets[i], original_seeds)
                divide_and_conquer_voronoi!(split_grid, new_seeds, count_colours)
            end
        end
    end
    return nothing
end

Scrap that, it’s slower despite the smaller memory allocation:

grid = zeros(Int64, 100, 100)
points = Tuple(Tuple.(rand(1:100, 2) for i in 1:10))
@btime divide_and_conquer_voronoi!(grid, points);
grid = zeros(Int64, 100, 100);
@btime jump_flood_voronoi!(grid, points);
468.952 μs (1 allocation: 160 bytes)
54.396 μs (862 allocations: 128.12 KiB)
1 Like

If you can estimate the size of the array, maybe sizehint! can help.

I probably misjudged. Nevertheless I don’t understand the need for dictionaries here. A naive implementation of the mentioned Wikipedia algorithm

function jfa!(grid, seeds)
    for (color, (x, y)) in enumerate(seeds)
        grid[x, y] = convert(eltype(grid), color)
    end
    k = max(size(grid)...)
    @inbounds while k > 1
        k = k ÷ 2 + k % 2
        for y in 1:size(grid, 2), x in 1:size(grid, 1)
            for j in (-k, 0, k), i in (-k, 0, k)
                if 1 <= y + j <= size(grid, 2) && 1 <= x + i <= size(grid, 1)
                    colorq = grid[x + i, y + j]
                    if colorq !== 0
                        colorp = grid[x, y]
                        if colorp === 0
                            grid[x, y] = colorq
                        else
                            distps = (seeds[colorp][1] - x)^2 + (seeds[colorp][2] - y)^2
                            distqs = (seeds[colorq][1] - x)^2 + (seeds[colorq][2] - y)^2
                            if distps > distqs
                                grid[x, y] = colorq
                            end
                        end
                    end
                end
            end
        end
    end
end

and benchmark via

@btime jump_flood_voronoi!(grid, original_seeds) setup=(
    Random.seed!(42);
    grid = zeros(Int64, 1000, 1000);
    original_seeds = reinterpret(reshape, Tuple{Int64,Int64}, rand(1:1000, 2, 100)))

@btime jfa!(grid, seeds) setup=(
    Random.seed!(42);
    grid = zeros(Int, 1000, 1000);
    seeds = [(rand(1:size(grid, 1)), rand(1:size(grid, 2))) for i in 1:100]) 

shows

  1.135 s (32012 allocations: 692.84 MiB)
  211.997 ms (0 allocations: 0 bytes)

Edit: fixed memory access order, added @inbounds

1 Like

Thanks for the reply, that’s fantastic so I’ll replace my implementation with that one. Looking back at my code it looks like I completely misunderstood how it’s supposed to loop.

Is there any hope for @simd or the @turbo macro from LoopVectorization.jl to give some extra speedup here? It shouldn’t matter what order the pixels are evaluated.

Only improvement I found was using @fastmath in a refactored

function distance(p, q)
    @fastmath (p[1] - q[1])^2 + (p[2] - q[2])^2
end

On a related note, you wrote

Are you sure? My benchmarks indicate otherwise and for larger sizes of grid and seeds something like

        Threads.@threads for i in 1:4
            if all(size(sub_grids[i]) .> 0)
                sub_seeds = map(seed -> seed .- offsets[i], seeds)
                dac!(sub_grids[i], sub_seeds)
            end
        end

starts to show speedup.

For future reference. After some offline discussion with @jacobusmmsmit we came up with three implementations for

  • jump flood algorithm
  • divide and conquer from here (thanks to @Ralph_Smith)
  • divide and conquer reducing seeds
function distance(p, q)
    @fastmath sqrt((p[1] - q[1])^2 + (p[2] - q[2])^2)
end
function dacx_aux!(grid, seeds, depth)
    if all(size(grid) .> 0)
        center = size(grid) .÷ 2 .+ size(grid) .% 2
        min_dist = findmin(seed -> distance(center, seed[2]), seeds)[1]
        seeds_dist = filter(seed -> distance(center, seed[2]) <= min_dist + distance((0, 0), size(grid)), seeds)
        dacx!(grid, seeds_dist, depth - 1)
    end
    grid
end
function dacx!(grid, seeds, depth=1)
    N, M = size(grid)
    corners = ((1, 1), (N, 1), (1, M), (N, M))
    nearest_colors = map(corner -> findmin(seed -> distance(corner, seed[2]), seeds)[2], corners)
    color = seeds[nearest_colors[1]][1]
    if all(nearest_color -> seeds[nearest_color][1] == color, nearest_colors[2:end])
        grid .= convert(eltype(grid), color)
    else
        Nd = N ÷ 2 + N % 2
        Md = M ÷ 2 + M % 2
        offsets = ((0, 0), (Nd, 0), (0, Md), (Nd, Md))
        tl_grid = @view grid[1:Nd, 1:Md]
        bl_grid = @view grid[Nd+1:N, 1:Md]
        tr_grid = @view grid[1:Nd, Md+1:M]
        br_grid = @view grid[Nd+1:N, Md+1:M]
        sub_grids = (tl_grid, bl_grid, tr_grid, br_grid)
        if depth > 0
            Threads.@threads for i in 1:4
                thread_seeds = map(seed -> (seed[1], seed[2] .- offsets[i]), seeds)
                dacx_aux!(sub_grids[i], thread_seeds, depth - 1)
            end
        else
            for i in 1:4
                for j in 1:length(seeds)
                    seeds[j] = (seeds[j][1], seeds[j][2] .- offsets[i])
                end
                dacx_aux!(sub_grids[i], seeds, depth - 1)
                for j in 1:length(seeds)
                    seeds[j] = (seeds[j][1], seeds[j][2] .+ offsets[i])
                end
            end
        end
    end
    grid
end

with the benchmark

@btime jfa!(grid, seeds) setup=(
    Random.seed!(42);
    grid = zeros(Int, 1000, 1000);
    seeds = [(rand(1:size(grid, 1)), rand(1:size(grid, 2))) for i in 1:100])

@btime dac!(grid, seeds) setup=(
    Random.seed!(42);
    grid = zeros(Int, 1000, 1000);
    seeds = [(rand(1:size(grid, 1)), rand(1:size(grid, 2))) for i in 1:100])

@btime dacx!(grid, seeds) setup=(
    Random.seed!(42);
    grid = zeros(Int, 1000, 1000);
    seeds = [(i, (rand(1:size(grid, 1)), rand(1:size(grid, 2)))) for i in 1:100]) 

yielding

  252.239 ms (0 allocations: 0 bytes)
  30.039 ms (35 allocations: 11.77 KiB)
  2.787 ms (59807 allocations: 6.96 MiB)

The benchmark is unfair insofar as the divide and conquer algorithms are easy to parallelize which we use to utilize 4 cores instead of one.

Edit: reduced allocations…

3 Likes

It just occurred to me that it makes a lot of sense to supply a benchmark for the baseline naive implementation:

naive(grid, sites) = map(
    cell -> findmin(seed -> distance(cell, seed), sites)[2],
    CartesianIndices(grid)
)

which is pretty cute as a one-liner but unfortunately gets quickly overtaken by the more sophisticated algorithms:

Setup: gridsize = (200, 200), nseeds = 100
Results:
"Naive" -> 2.335 ms (2 allocations: 31.33 KiB)
"JFA" -> 1.090 ms (0 allocations: 0 bytes)
"DACX" -> 522.026 μs (6492 allocations: 1.61 MiB)
2 Likes

See DiscreteVoronoi.jl for fast Julia implementations of a variety of methods.