Primal-Dual Hybrid Gradient for L1 regularization

Greetings everyone,

I was looking for a way to solve the following problem using julia:
min_{f\geq0} \; ||Kf-s||_2^2 +\alpha ||f||_1

There’s this paper (section 2.2 specifically), where the authors use a Primal-Dual Hybrid Gradient algorithm to solve it.

It seems that this package: FirstOrderLP.jl, would be able to handle these problems, however, I’ve got no idea how to use it, and can’t seem to find any detailed documentation for it.

Any insights or resources would be appreciated!

2 Likes

If you have an MPS file, the suggested usage is: GitHub - google-research/FirstOrderLp.jl: Experimental first-order solvers for linear and quadratic programming.

There isn’t a JuMP interface for FirstOrderLP.jl.

If you just want to solve the problem, you could use:

using JuMP, Ipopt
m, n = 20, 10
K, s, a = rand(m, n), rand(m), 0.5
model = Model(Ipopt.Optimizer)
@variable(model, f[1:n] >= 0)
@objective(model, Min, sum((K * f .- s).^2) + a * sum(abs.(f)))
optimize!(model)
@assert is_solved_and_feasible(model)
value.(f)

But a better formulation is probably:

model = Model(Ipopt.Optimizer)
@variables(model, begin
    f[1:n] >= 0
    residuals[1:m]
    l1_terms[1:n] >= 0
end)
@constraints(model, begin
    residuals .== K * f .- s
    l1_terms .>= f
    l1_terms .>= -f
end)
@objective(model, Min, sum(residuals.^2) + a * sum(l1_terms))
optimize!(model)
@assert is_solved_and_feasible(model)
value.(f)
2 Likes

Thank you very much for your reply, @odow!

Unfortunately, Ipopt doesn’t seem to converge in cases where K is large (10<m<60 , n\approx10,000 ).
The first formulation in your post does not even begin to run, and the second just keeps going on forever.
Not sure whether there’s another open-source solver in JuMP able to handle matrices of this size.

Would there be any “standard” ways of writing MPS files through julia?
My goal is to implement this solver in a package I’m working on, so I’d like to keep things as streamlined as possible.

While it is mainly designed for nonlinear domains (that is costs defined on Riemannian Manifolds) – Manopt.jl does have a Chambolle-Pock algorithm Chambolle-Pock · Manopt.jl – which is closely related to PDHGMP, see this paper; the caveat here is that it seems to have the K in the L1 term and not in the data term (i.e. we use a variable g = Kf and have a corresponding term in the L1). So with that it might also not fit too well, especially because the algorithm there is a bit more technical (since it is more general) and for you a few things simplify.

But maybe looking also for a Chambolle-Pock implementation might help as well as an idea.

1 Like

FirstOrderLp is specialized for linear programming problems; it cannot handle the L1 regularization problem. ProximalAlgorithms.jl might be another place to look.

For a textbook version of PDHG, I suspect most people write it themselves because it’s about 10 lines of code.

1 Like

Thank you both for your replies!

I reckon ProximalAlgorithms.jl might be the way to go here.
Perhaps something like the following:

using ProximalAlgorithms, ProximalOperators, LinearAlgebra

K = rand(100, 100)
s = rand(100)
α = 1

function cost(f)
    r = K * f - s
    return  dot(r, r)
end

reg = ProximalOperators.NormL1(α)

ffb = ProximalAlgorithms.FastForwardBackward()

solution, iterations = ffb(x0=zeros(size(K,2)), f=cost, g=reg)

However, I’m not sure how the f \geq 0 would be implemented, any ideas?

Also, this uses the FastForwardBackward() algorithm instead of ChambollePock(). I’m not sure how to write the primal-dual form of the problem using the ProximalAlgorithms.jl notation.

1 Like

Update:
As @miles.lubin suggested, writing the code down is about 10 lines long:

using LinearAlgebra

function PDHGM(K, s, α; tol=1e-6, τ=0.1, σ=0.1)

    B = inv(I + τ * α * K' * K)
    Y = zeros(size(K, 2))
    Ỹ = deepcopy(Y)
    f = ones(size(K, 2))
    f̃ = deepcopy(f)
    f_prev = deepcopy(f)
    ε = tol + 1

    while ε > tol

        Ỹ .= Y + σ * f̃
        Y .= Ỹ ./ max.(1, abs.(Ỹ))
        f .= B * (f̃ - τ .* Y + τ * α * K' * s)
        f .= max.(0, f)
        f̃ .= 2 .* f .- f_prev

        ε = norm(f - f_prev) / norm(f_prev)
        f_prev .= f
    end
    return f
end

f = PDHGM(rand(10, 100), rand(10), 1)

Basically just copied what’s reported here (paper cited in original post).
This seems to converge, can’t tell whether it’s accurate/efficient or not, suggestions welcome!

Please note that the above solves \frac{a}{2}||Kf-s||_2^2+||f||_1, so increasing/decreasing α would have the opposite effect compared to the original formulation.

1 Like

For both ProximalAlgorithms, it should be fine setting the cost of a negative f to +\infty if the is still something that troubles you.

but if your own one works fine, that should work well, I am not sure one can avoid the deepcopys, but you only do them in the beginning, so that should be fine as well.

1 Like

FastForwardBackward is more like PDHG, as far as I see, but both are quite similar anyways.

I’m a bit curious here, what would be the optimal way to implement this in code?


As for the deepcopies, it’s just a matter of initializing the parameters and allocating memory to them. Perhaps using Vector{Float64}(undef,n) might be slightly better here? Not sure whether it makes an actual difference or not.

Why not just copy?

Admittedly, that’s just a silly habit of mine.

Had some issues in the past where copy did not work as I would anticipate (within some nested local scopes), and replacing it by deepcopy “magically” fixed the issue. Never managed to figure out why, and I’ve been a bit reluctant to use copy() ever since.

Still find it difficult to fully understand the differences between the two. Would it be significant in this instance?

I think, for this kind of formulation of the problem a faster solution will be based on Coordinate Descent:

Concerning the “how” – maybe your proximal maps do that automatically, i.e. keeping the f nonnegative. However, if they do not, your gradient steps probably don’t, then you have to check.

So the cost might not be the main point here but computing the (inner) optimization problem of a constrained proximal map.
You can of course also just be lucky that f \geq 0 is in your cases automatically fulfilled. But that would really just be luck.

I was working on a blog post on deepcopy so I figured I’d put it up in case it’s helpful: Why you might avoid `deepcopy` in Julia | EPH.

How much accuracy do you need? If 6-8 figures will do the job, you might try replacing ||f||_1 with a smooth L1 norm

function sL1(x::AbstractVector{T}, mu = 1.e-7) where {T}
    snrm = sum(Cabs.(x, mu))
    return snrm
end

function Cabs(x, mu)
    p = 4.0 * mu * mu
    Cabs = sqrt(x * x + p)
    return Cabs
end

where Cabs a smoothed absolute value function. Setting mu=1.e-7 works for me. Once that’s done, using BFGS from Optimization.jl should perform very well.

1 Like

@RoyiAvital, thanks a lot for the link, these benchmarks are very useful!
Is there a standard way to implement the coordinate descent method?
There’s CoordinateDescent.jl, but I was wondering if there are any alternatives I should be aware of, before trying it.

No luck unfortunately! I’ll try to dig into the documentation of ProximalAlgorithms a bit more to see if there’s an obvious way to do it.

Thanks a lot for the link Eric!
In the present case, I can’t tell why copy would be better than deepcopy. After all, the copied vector should be a fully separate entity (afaik). I’m only using deepcopy as a lazy way to create a new vector of the same size. Please correct me if I’m wrong.

@ctkelley Thanks for the suggestion,but would you please elaborate on what’s the difference between these two norms?
Also, what would the code look like using your example?

They should have exactly the same results in this case, and therefore I’d skip deepcopy since it is slower and unnecessary here. For reviewers deepcopy can be an indication of other issues so it’s better to not trip their warning heuristics unnecessarily :slight_smile:. Or keep it, if you have been tripped up by aliasing issues before and it gives some sense of safety.

I just wanted to answer your question about what the differences are and why to use one or the other.

1 Like

Look at the MATLAB code on the link.
Converting it into Julia will be straight forward.
Julia will allow even better memory handling.

I don’t think a package is needed for this specific case.

1 Like

Understood, I’ll give it a go and post it here soon.
Just to clarify, would we be able to implement the f\geq0 constraint in this case?

All clear, thanks again!

1 Like