Optimizing code for NMF (Non-negative Matrix Factorization)

Problem

I am trying to implemented majorization-minimization (MM) algorithm to solve the NMF problem

\min_{\mathbf{V},\mathbf{W}}\Vert \mathbf{X}-\mathbf{VW}\Vert_F^2=\sum_i\sum_j(x_{ij}-\sum_k v_{ik}w_{kj})^2

The optimal \mathbf{V} and \mathbf{W} are found using iterative updates

\begin{array}{c}{v_{i k}^{(t+1)}=v_{i k}^{(t)} \frac{\sum_{j} x_{i j} w_{k j}^{(t)}}{\sum_{j} b_{i j}^{(t)} w_{k j}^{(t)}}, \quad \text { where } b_{i j}^{(t)}=\sum_{k} v_{i k}^{(t)} w_{k j}^{(t)}} \\ {w_{k j}^{(t+1)}=w_{k j}^{(t)} \frac{\sum_{i} x_{i j} v_{i k}^{(t+1)}}{\sum_{i} b_{i j}^{(t+1 / 2)} v_{i k}^{(t+1)}}, \quad \text { where } b_{i j}^{(t+1 / 2)}=\sum_{k} v_{i k}^{(t+1)} w_{k j}^{(t)}}\end{array}

with termination criterion

\frac{\left|L^{(t+1)}-L^{(t)}\right|}{\left|L^{(t)}\right|+1} \leq 10^{-4}

The implementation is fairly easy and it could indeed converge but turns out to be extremely slow and memory-consuming for a moderate scale problem.

Note, in the following, V and W sent into the function are initial guesses and X is the matrix we are trying to approximate using MM algorithm. r is the hyperparameter used to control the granularity of approximation.

using LinearAlgebra

function nnmf(X::Matrix{T}, 
              r::Integer;
              maxiter::Integer=1000, 
              tol::Number=1e-4,
              V::Matrix{T}=rand(T, size(X, 1), r),
              W::Matrix{T}=rand(T, r, size(X, 2))) where T <: AbstractFloat
    L = 0
    row, col = size(X)
    for iter in 1:maxiter
        # Step 1
        b = V * W
        # Step 2
        V_new = copy(V)
        for k in 1:r
            for i in 1:row
                weight = (X[i, :]' * W[k, :]) / (b[i, :]' * W[k, :])
                V_new[i, k] *= weight
            end
        end
        # Step 3
        b_new = V_new * W
        # Step 4
        W_new = copy(W)
        for j in 1:col
            for k in 1:r
                weight = (X[:, j]' * V_new[:, k]) / (b_new[:, j]' * V_new[:, k])
                W_new[k, j] *= weight
            end
        end
        # update V, W
        V, W = V_new, W_new;
        L_new = norm(X - V * W)^2
        rel_diff = abs(L_new - L) / (abs(L) + 1)
        if rel_diff <= tol
            break
        end
        # update L
        L = L_new
        println("Iteration: $iter, Relative Difference: $rel_diff")
    end
    return V, W
end

I am pretty new to programming in Julia, could someone if anything above could be improved?

Edit

Thank you for all of your advice, I followed them and get significant performance increase in terms of time (from 150+s to about 7s). At the same time, the code becomes elegant when as much as matrix multiplication is use. However, the memory allocation is still concerning (8.25GB).

I read the performance tips section in official doc. It seems that pre-allocating memory could help. But I am not sure how to do.

Could anyone provide some pointers? Thank you in advance.
image

function nnmf(X::Matrix{T}, 
              r::Integer;
              maxiter::Integer=1000, 
              tol::Number=1e-4,
              V::Matrix{T}=rand(T, size(X, 1), r),
              W::Matrix{T}=rand(T, r, size(X, 2))) where T <: AbstractFloat
    L = 0
    row, col = size(X)
    V_new = zeros(eltype(V), row, r)
    W_new = zeros(eltype(W), r, col)
    for iter in 1:maxiter  
        # Step 0: evaluate last step's result
        res = V * W
        L_new = norm(X - res)^2
        rel_diff = abs(L_new - L) / (abs(L) + 1)
        if rel_diff <= tol
            break
        end
        # update L
        L = L_new
        # Step 1      
        copyto!(V_new, V)
        copyto!(V_new, V_new .* ((X * W') ./ (res * W')))
        # Step 2
        copyto!(W_new, W)
        copyto!(W_new, W_new .* ((X' * V_new) ./ (W' * V_new' * V_new))')
        # Step 3
        # update V, W
        copyto!(V, V_new)
        copyto!(W, W_new)
    end
    return V, W
end


By decreasing order of importance:

Read the “performance tips” section in the manual

Profile your code. See what takes time, don’t guess. Redo this after every modification to your code.

Make sure your code is fine by @code_warntype

Express as many operations as you can as matrix-vector or (better) matrix-matrix multiplications to take advantage of optimized BLAS.

Reuse memory by using in-place operations (eg mul!), and avoid copies of slices with @views.

2 Likes

@views will probably help for the array slicing operations.

What’s the size of X? Is most of the time just being spent in println? What’s a realistic benchmark that anybody could copy-paste and run? Something like

using BenchmarkTools
@btime nnmf(X, r) setup = begin
    X = rand(3, 3)
    r = 10
end

in addition to your current code.

It might be worthwhile to keep the transposed matrices as work buffers and do the copying of data between them between each loop. This way you can always slice into contiguous memory with views. The copying to already allocated memory is cheap if done in one go.

See copyto!

2 Likes

In this particular case I think the largest gain is to be achieved by recognizing matrix-matrix multiplies (eg sum over j of xij wkj is X*W’). It also simplifies the code. Note that things like A = B'*C or mul!(A,B',C) dispatch to an optimized BLAS implementation (in the second case, with no memory allocation)

1 Like

Implementation aside, it’s worth pointing out that the particular algorithm you’ve chosen is itself fairly slow to converge. One of the more popular “fast” alternatives is described in Cichocki, Andrzej, Rafal Zdunek, and Shun-ichi Amari. “Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization.” International Conference on Independent Component Analysis and Signal Separation . Springer, Berlin, Heidelberg, 2007. See also Julia packages https://github.com/JuliaStats/NMF.jl and https://github.com/madeleineudell/LowRankModels.jl.

3 Likes

The dataset is a gray-scale face image dataset of size 2429 \times 391, i.e. \mathbf{X}\in \mathbb{R}^{2429 \times 391} where each row represents a 19\times 19 image.

There does exist a Julia package called NMF.jl that could be used for comparison purpose. But MM algorithm does not seem to be implemented there.

I didn’t check carefully, but isn’t your algorithm the same as the multiplicative update rule?

Funny to see my homework assignment here!
http://hua-zhou.github.io/teaching/biostatm280-2019spring/hw/hw2/hw02.html
Students seem to be working hard to improve their Julia code :joy:

2 Likes

Yes it’s exactly the multiplicative update rule. The algorithm can be derived from the generic majorization-minimization principle, which is a topic in the later part of the course. That’s why it’s called the majorization-minimization (MM) algorithm here.

1 Like