Calculating convex hulls through LP

Hello, I am currently working on a problem in which I need to calculate multiple convex hulls for large datasets, as the following image shows:

Two things to note:

  • I don’t need adjacency information, simply knowing the vertices is enough
  • Obviously this example has two dimensions, but I’m working with machine learning datasets with higher dimensions

Originally, I was using the N-dimensional quick hull algorithm implemented in LazySets.jl, but recently I’ve come to try tackling this problem through linear programming.

For reference, here is my solution using the Quick Hull algorithm:

using Polyhedra, LazySets

function calculate_convex_hulls_qh(X)
    N, D = size(X)
    
    Xpoints = [X[i,:] for i in 1:N]
    Xindices = [i for i in eachindex(Xpoints)]
    n_hulls = 1

    Xhullindices = Vector{Vector{Int64}}()

    println("Calculating first hull...")

    # first convex hull
    hull = LazySets.convex_hull(Xpoints)
    n_hulls += 1

    # index mapping the points
    hull_indices = [Xindices[findfirst(i -> Xpoints[i] == hull[hi], Xindices)] for hi in eachindex(hull)]
    push!(Xhullindices, hull_indices)

    # remove already added indices from index list, create new list with remaining points
    # original list of point is needed for the index mapping
    Xindices = [i for i in Xindices if !(i in hull_indices)]
    X_remaining = Xpoints[Xindices]

    # while there are more remaining points than D
    # at least D+1 points are necessary for the hulls
    while length(Xindices) > D
        # create another hull
        println("Calculating hull $(n_hulls)...")
        hull = LazySets.convex_hull(X_remaining)

        # index map
        hull_indices = [Xindices[findfirst(i -> Xpoints[i] == hull[hi], Xindices)] for hi in eachindex(hull)]
        push!(Xhullindices, hull_indices)

        # remove the points from the remaining set and index set
        Xindices = [i for i in Xindices if !(i in hull_indices)]
        X_remaining = Xpoints[Xindices]

        n_hulls += 1
    end

    # if there are any remaining points add them to the last set
    if length(Xindices) > 0
        println("Adding remaining points...")
        append!(Xhullindices[end], Xindices)
    end

    # return indices of each hull and indices which are not vertices (inside last set)
    return Xhullindices, Xindices
end

I’ve tried two LP solutions so far:

Solution 1

The first one is simply going through every point and seeing if it can be formed as the convex combination of every point in the set:

min_{\alpha \geq 0} \; \alpha_i
s.t.
\Sigma \alpha = 1
X^T\alpha = x_i

When checking the point x_i, I try to minimize its own weight \alpha_i in the convex combination. This way, if the objective value is above zero it must be a vertex.

Then, I simply go through every point, update the right hand side of the constraint with x_i and mark the vertices. This way, I have every vertex of the first hull.

After that, I fix the contributions \alpha_i of every found vertex at zero, check the inner points again, mark the vertices of the second hull, etc. until I have every hull.

Below is my implementation for this solution:

using JuMP, GLPK

function calculate_convex_hulls_lp_v1(X)
    N, D = size(X)

    # building initial model
    model = JuMP.Model(GLPK.Optimizer)
    set_silent(model)

    # list of vector of indices for each hull (outer -> inner)
    hulls_idx_vector = Vector{Vector{Int64}}()
    free_points = ones(N) # indices free = 1; indices that are in a hull = 0

    # general convex combination variables
    @variable(model, α[i=1:N] .>= 0)
    @constraint(model, sum(α) == 1)

    Xtp = permutedims(X)
    @constraint(model, cc,  Xtp*α .== view(X,1,:))

    # until we can't make more hulls
    n_hulls = 1
    first_free_point = 1
    while true
        # going through points to make a hull
        hull_indices = Vector{Int64}()
        println("Hull $(n_hulls)...")
        
        for i in first_free_point:N
            # skipping points already in a hull
            if (free_points[i] == 0) continue end

            # set point constraint
            set_normalized_rhs.(cc,view(X,i,:))
            @objective(model, Min, α[i])

            # solving point
            JuMP.optimize!(model)

            if (objective_value(model) > 0)
                # point is vertex
                push!(hull_indices, i)
                free_points[i] = 0
            end
        end

        # quit if there are too few points in this hull (no progress)
        if length(hull_indices) <= D
            println("No more hulls can be formed")
            free_points[hull_indices] .= 1
            break
        end

        # storing hull
        push!(hulls_idx_vector, hull_indices)
        n_hulls += 1

        # updating first free point
        first_free_point = findfirst(==(1), free_points)

        # locking points in hull to 0 (can't be used)
        fix.(α[hull_indices], 0; force=true)

        # quit if there are no points left (it's over)
        if isnothing(first_free_point) println("All point are in hulls"); break end
    end

    # create list of non vertices and add to last hull
    non_vertex_points = [i for i in 1:N if free_points[i] == 1]
    append!(hulls_idx_vector[end], non_vertex_points)

    return hulls_idx_vector, non_vertex_points
end

Solution 2

The second solution is based on the same convex combination problem, but it attemps to deal with less variables by going through the points in the following way:

  • Start with every convex combination variable \alpha fixed at 0
  • Unfix the variable \alpha_i for the point i you’re going to check
  • Solve the optimization problem
    • If the objective value is above zero, mark the point i as ‘vertex candidate’
    • If it is zero, fix the variable \alpha_i back to 0

This way we have a group of ‘vertex candidates’. Basically, we have definitely marked every vertex, but also some extra points that aren’t vertices (the first point that is checked, for example, definitely goes in). We can just go through the candidates again and recheck them to find the true vertices (only leaving unfixed all other vertex candidates).

Below is my implementation for this algorithm:

using JuMP, GLPK

function calculate_convex_hulls_lp_v2(X)
    N, D = size(X)

    # ranking points (to go through them in a 'smarter' way, outermost points are looked at first)
    Xavg = vec(mean(X, dims=1))
    Xdistances = [norm(X[i, :] - Xavg) for i in 1:size(X, 1)]
    Xranked = sortperm(Xdistances;rev=true)

    # building initial model
    model = JuMP.Model(GLPK.Optimizer)
    set_silent(model)

    # list of vector of indices for each hull (outer -> inner)
    hulls_idx_vector = Vector{Vector{Int64}}()
    free_points = ones(N) # indices free = 1; indices that are in a hull = 0

    # general convex combination variables
    # start with all of them locked at 0
    @variable(model, α[i=1:N] .== 0)
    @constraint(model, sum(α) == 1)

    Xtp = permutedims(X)
    @constraint(model, cc,  Xtp*α .== view(X,1,:))

    # until we can't make more hulls
    n_hulls = 1
    first_free_ranked_point = 1
    while true
        println("Hull $(n_hulls)...")

        # first pass = going through points to get candidates
        vertex_candidates = Vector{Int64}()
        
        for i_rank in first_free_ranked_point:N
            # skipping points already in a hull
            i = Xranked[i_rank]
            if (free_points[i] == 0) continue end

            # set point constraint and objective
            set_normalized_rhs.(cc,view(X,i,:))
            @objective(model, Min, α[i])

            # solving point
            JuMP.optimize!(model)
            if (!is_solved_and_feasible(model))
                # point is vertex candidate
                # free it up for use
                push!(vertex_candidates, i)
                unfix(α[i])
                set_lower_bound(α[i], 0)
            end
        end

        # second pass = go through candidates to find true vertices
        hull_indices = Vector{Int64}()
        for i in vertex_candidates
            # set point constraint and objective
            set_normalized_rhs.(cc,view(X,i,:))
            @objective(model, Min, α[i])

            # solving point
            JuMP.optimize!(model)
            if (objective_value(model) > 0)
                # point is vertex
                push!(hull_indices, i)
                free_points[i] = 0
            else
                # point is not vertex (fix it)
                fix(α[i], 0; force=true)
            end
        end

        # quit if there are too few points in this hull (no progress)
        if length(hull_indices) <= D
            println("No more hulls can be formed")
            free_points[hull_indices] .= 1
            break
        end

        # storing hull
        push!(hulls_idx_vector, hull_indices)
        n_hulls += 1

        # updating first free point
        first_free_ranked_point = findfirst(i -> free_points[i] == 1, Xranked)

        # locking points in hull back to 0 (can't be used)
        fix.(α[hull_indices], 0; force=true)

        # quit if there are no points left (it's over)
        if isnothing(first_free_ranked_point) println("All point are in hulls"); break end
    end

    # create list of non vertices and add to last hull
    non_vertex_points = [i for i in 1:N if free_points[i] == 1]
    append!(hulls_idx_vector[end], non_vertex_points)

    return hulls_idx_vector, non_vertex_points
end

The question

Basically, I’m looking into ways of further optimizing this problem, since it can take quite a while to compute in large datasets.

I have benchmarked these solutions using the dataset UCI Online News Popularity and here’s some things I’ve already noted:

  • The second solution tends to win when there are a lot of points and not that many dimensions, but it gets really slow when the number of dimensions increases. Some reference times (N points, D dimensions):

    • N=2000, D=5: First solution took 4 s, second took 2.6 s (QuickHull = 3.1 s)

    • N=4000, D=5: First solution took 23.7 s, second took 13.2 s (QuickHull = 16.8 s)

    • N=2000, D=30: First solution took 15.4 s, second took 28.1 s (QuickHull = 20.5 seconds)

    • N=4000, D=30: First solution took 70.3 s, second took 114.9 s (QuickHull = 88.1 s)

  • GLPK beats HiGHS pretty consistently. This is curious to me since from what I’ve read around HiGHS is widely considered to be the better linear solver, so I wonder if some of what I’m using can be changed to get better results out of it. For reference, here are the results using HiGHS for the first solution:

    • N=2000, D=5: 9.4 s (GLPK = 4 s)

    • N=4000, D=5: 44.4 s (GLPK = 23.7 s)

    • N=2000, D=30: 17.7 s (GLPK = 15.4)

    • N=4000, D=30: 78.4 s (GLPK = 70.3 s)

While it certainly wasn’t that much of a precise benchmark and I’d take the exact times with a grain of salt, these patterns are something I’ve noticed while running this test quite a few times.

Anyway, I appreciate any suggestions of how to tune these current solutions (or any new solution that I could check out). Thanks!

4 Likes

Hi @joaoqcouto, welcome to the forum :smile:

The JuMP code looks reasonable. I don’t have much to add. HiGHS in general should be more efficient than GLPK. But these are very small LPs solved many times. So maybe HiGHS is doing something that typically benefits larger models and GLPK is just trying to solve.

It’s hard to know without a log from HiGHS or the exact dataset.