[Convex.jl] Objective function of matrix factorization with missing data

I’m trying to solve a matrix factorization problem with alternating minimization. My data matrix has missing entries and because of that I can not simply write my objective function as:

# A is a n x m data matrix 
n, m = size(A)
k =  2
X = Variable(n, k)
Y = Variable(k, m)

problem = minimize(sumsquares(A - X*Y), (X*Y)>=0)

I have come up with two ways to address this issue, but one is very slow and uses to much memory, while the other is giving NaN results in a lot of experiments. I’m going to describe both procedures for writing the objective function below:

Approach 1
objective = 0
for o in observations  
    objective += square(A[o...] - (X*Y)[o...])
end

where observations is an Array of tuples, with each tuple containing the indices of an entry in matrix A that was indeed observed, i.e., non-missing.
This approach uses all the RAM in my computer and is really slow when calling problem = minimize(objective, constraints) and every time solve!(problem, SCSSolver()) is called.

Approach 2
B = coalesce.(A, 0.0) # replace missing entries with 0.0 for .* to work
objective = sumsquares(M .* B - M .* (X*Y)) 

where M is a BitArray{2}, which 1s representing observed values and 0s the missing ones.
This approach is faster, I suppose because the DCP problem tree has a significant less branches. The issue is that running my program using this approach, using the same data and random.seed, often gives me a Y matrix of NaNs or an overall bad approximation.

Does anyone know a better way of writing this problem using Convex.jl?

Could you say the types of x and y? is x*y a matrix? and which is a variable?

Both are variables and their multiplication is a matrix, but since I’m using alternating minimization one is fixed while the other is being optimized at their respective iteration steps. My code is more or less like the example in the documentation: Fixing-and-freeing-variables
The difference is that I’m constraining X*Y to be positive while in the example both have to be positive individually.
I’m also using regularizers that I omitted, I thought that part was not relevant for the question:

objective += λ1 * sumsquares(X) +  λ2 * norm(vec(Y), 1)

Another way of stating my main question is: What is optimal way of writing an objective function in Convex.jl when there is missing data, since logical indexing is not supported.

Ok, thanks for clarifying :slight_smile:. I am not sure, I haven’t dealt with missing data before. If we find a good solution in this thread we could try to upstream it to Convex.jl to do the transformation automatically.

The scalar-indexing way of doing it in Approach 1 will be slow for Convex.jl as you’ve seen; approach 2 looks reasonable to me, and also looks like it should be doing the exact same thing as Approach 1, so I don’t really understand why it gives you worse results. It seems like the terms that are missing in the objective should just get set to zero in the optimization due to your regularization and the constraints.

Does it help if you constraint them to be zero explicitly? E.g. (I - M).*(X*Y) == 0 ? In other words, that constraint says if an entry of the mask isn’t 1, the corresponding element of X*Y must be zero.

edit: fix since it’s matrix mult.

The idea of this problem is for X*Y to “predict” the values that are missing in A, so I don’t really want to enforce that constraint.

I’m trying to achieve something like what is done in the package LowRankModels but I need to constrain (x*y)>=0 which i believe is not currently possible with what they offer.

Ah I see, makes sense. My thought about setting them to zero doesn’t make sense actually.

I still don’t really see why your two approaches would give different results. Are you sure they do? Or do you see why they might?

I’m not sure :frowning:, but testing the first one is really time consuming.

Can you try with smaller examples? (maybe fake data?)

I’m going to post here the results with a small matrix and no warmstart.

1 Like

For data matrix A:

21×6 Array{Union{Missing, Float64},2}:
  96.0   9.6  49.0   missing   missing  0.0
 102.0  12.1  44.0   missing   missing  0.0
 122.0  15.6  35.0   missing   missing  0.0
 102.0  16.2  30.0   missing   missing  0.0
  93.0  13.7  32.0   missing   missing  0.0
  81.0  12.0  34.0   missing   missing  0.0
  73.0  10.5  36.0   missing   missing  0.0
  78.0  11.9  34.0   missing   missing  0.0
  70.0   9.9  35.0   missing   missing  0.0
  58.0   5.0  37.0  8.9       7.0       0.0
  67.0   0.0  33.0  8.3       6.5       0.0
  62.0   5.6  30.0  5.6       4.0       0.0
  73.0   7.2  29.0  5.3       3.8       0.0
  81.0  12.3  23.0  4.4       3.0       0.0
  90.0  21.6  15.0  4.7       2.8       0.0
 270.0  40.8   5.0  6.7       2.0       0.0
 287.0  44.8   5.0   missing   missing  0.0
 223.0  35.7   8.0   missing   missing  0.0
 197.0  31.1  10.0   missing   missing  0.0
 165.0  23.5  17.0   missing   missing  0.0
 131.0  16.4  26.0   missing   missing  0.0

The first approach gives me X*Y:

21×6 Array{Float64,2}:
  91.9092  10.1851   48.6055   10.3582   4.42137   0.0
  99.9766  12.0842   44.194     9.62449  3.98686   0.0
 120.623   14.7878   35.319     7.71501  3.21652   0.0
 102.943   14.4898   30.2468    7.09424  2.64197   0.0
  97.4936  12.6538   32.1014    7.1987   2.86985   0.0
  80.0891  11.0292   34.2599    7.81053  3.01237   0.0
  71.7516   9.81708  36.3294    8.21785  3.19767   0.0
  77.1258  10.8139   34.2516    7.84817  2.99995   0.0
  75.4579   9.61945  35.1918    7.79713  3.14042   0.0
  55.6936   5.81026  37.594     7.96936  3.41595   0.0
  61.7518  -4.87533  33.6647    4.12749  3.76301   0.0
  60.0875   6.36114  30.1838    6.34469  2.76851   0.0
  71.055    7.83318  29.0857    6.13347  2.67479   0.0
  80.8335  11.2827   22.909     5.36107  2.00633   0.0
  99.6884  28.3298   15.2408    7.51133  0.46556   0.0
 273.438   36.5091    4.66263   1.71697  0.489189  0.0
 288.905   38.5283   21.5737    5.45614  1.98183   0.0
 225.538   30.4491    7.96133   2.40695  0.744823  0.0
 200.076   26.8858   10.2254    2.79668  0.946108  0.0
 165.272   20.8932   17.1064    3.86819  1.62118   0.0
 131.29    15.6776   26.0399    5.5661   2.4367    0.0

And the second give me X*Y:

21×6 Array{Float64,2}:
  95.7742  13.5202   49.1744   6.27984  5.50477  0.0
 101.952   14.6198   44.1296   6.03629  5.04362  0.0
 121.895   17.9682   35.1179   5.82428  4.27753  0.0
 102.408   15.0817   30.0102   4.93315  3.64402  0.0
  44.8543   6.08853  31.8195   3.63489  3.45117  0.0
  81.3184  11.6932   34.0355   4.72284  3.90735  0.0
  73.3056  10.3925   36.0425   4.68064  4.05485  0.0
  78.3754  11.2362   34.025    4.64832  3.88725  0.0
  32.1925   3.4427   34.9445   3.857    3.90078  0.0
  57.9713   6.59883  36.6976   5.03668  4.54108  0.0
  66.5154   2.67207  33.8019   7.65822  6.30199  0.0
  61.793    8.13665  29.2899   4.18252  3.55391  0.0
  73.1097   6.70507  28.7908   6.00971  4.77469  0.0
  81.4703   9.65239  22.6476   5.0037   3.6663   0.0
  91.4754  12.5285   14.7292   4.08452  2.53267  0.0
 269.829   41.7838    5.19122  7.16603  2.26238  0.0
 286.988   44.5498    1.59522  7.31184  2.01625  0.0
 223.213   34.4627    7.99651  6.22027  2.23929  0.0
 197.158   30.3583   10.0101   5.72682  2.27064  0.0
 164.848   25.1419   17.0826   5.4761   2.76404  0.0
 130.688   19.5834   26.1328   5.33518  3.44191  0.0

While the random seed is the same I don’t know if I both results should be identical here, since there are a lot of floating point operations involved…

The issues start when matrix A is really big.

Huh, those do look fairly different; e.g., the [5,1] entry looks weird in the second case (44 instead of like 80). Are you sure the mask is correct? Also, shouldn’t just B be enough in the objective instead of M .* B? (Since B is already the masked version of A). I guess B ≈ M .* B could be a test to check your mask.

The mask also considers the validation entries, not just the missing values.

The mask is ok. I built it like this:

function train_mask(A, validation_tuples)
    mask = A .!== missing
    for v in validation_tuples
        mask[v...] = 0
    end
    return mask
end

I think the main difference between the approaches is that in the first one, for the loss function, the solver completely ignores what is not in observations, while in the second those terms exist in the function but are always being multiplied by 0. I don’t know if with Floats this could affect stuff.

So ideally we would want something like the first approach but with the speed of the second. I tried logical indexing, but the package doesn’t support it.

I think that really shouldn’t affect things unless there are infinite terms. Internally, Convex.jl indexes by masking too— which is why it’s slow when you have to do it for each term in the matrix, instead of just once like in your approach 2.

So I’m not really sure what’s wrong. Maybe with the large problems, the solver isn’t converging to tolerance, or thinks the problem is infeasible, etc, causing the NaN’s. That wouldn’t explain the difference in the small problem though.

I would try to make the two solves as comparable as possible, for the small problem between approach 1 and 2. For example, I’d compare the results after just 1 iteration of your alternating procedure and try to figure out why they are giving different results; maybe see how many iterations the solver is doing in each case, etc. And double-check the starting values are the same too.

1 Like

Ok, i’m going to try that. Thank you for trying to help me!

1 Like

Regarding your question before, the mask also considers the validation entries, so I have really need to do M .* B. I edited my previous reply to that quote because of that.

Ah makes sense, I didn’t pick up on the validation bit

The issue is that I was using sumsquares as I was not aware of what is stated in this page: Eliminating quadratic forms in CVX.

However, after implementing in Matlab+CVX, Python+CVXPY and Julia+Convex.jl, I think that
in Julia, having to perform

norm(vec(M .* A - M .* (X*Y)), 2)

vs

cp.Minimize(cp.norm(A[M] - cp.matmul(X, Y)[M]))

in Python, and vs

expression Z
Z + (X*Y)
minimize(norm(A(M) - Z(M), 'fro'))

in Matlab, has a significant negative impact in memory consumption for Julia. To the point that for the same data, I can run the algorithm using Matlab or Python on a computer with 8GB of RAM while being unable to do it using Julia on a 16GB computer.

1 Like