Does the singular value degeneracy cause the gradient to be NAN?

I recently encountered a situation where the result of a computation was of type NAN. I got the following example after a step-by-step debug:


using Zygote
using OMEinsum 
using LinearAlgebra
function f(T)
    @ein M[ b, e, c, d ] := ( T[ a, b, u, f ] * T[ a, c, u, h ] ) * ( T[ l, e, f, m ] * T[ l, d, h, m ] )
    d = size( T, 1 )
    M = reshape( M, ( d^2, d^2 ) )
    M = ( M + M' ) / 2
    U, S,_ = svd( M )
    @show S
    @ein Coef[] := U[ i, i] 
    return Coef[]
end

function test()

    x0=0.5
    T1 = [0.9988894589661916 2.4141401009672692e-8 -9.21073527124551e-18; 2.414140092879679e-8 1.0113486732236356e-13 -3.254292770737505e-24; -2.4058145031123452e-19 2.4387329449624486e-24 1.002665275075667e-13;;; 2.2369956667600393e-8 0.01177751073439025 0.01177685069366975; 0.011777510734396866 5.692640873553409e-10 2.8460655019662047e-10; 0.011776850693676363 2.846065512560657e-10 -1.9093252656828917e-14;;; 8.318087750543311e-19 -0.011777510733978307 -0.011776850693259231; 0.011777510733971711 9.483301236635998e-19 -2.846256431019167e-10; 0.01177685069325263 2.846256439393433e-10 -1.1084366881805116e-19;;;; 2.2369956656976694e-8 0.011777510734401848 -0.011776850693658138; 0.011777510734408457 5.692640873559015e-10 -2.846065504135412e-10; -0.011776850693664753 -2.8460655126146057e-10 -1.909302979277114e-14;;; 3.130556155863574e-7 -4.861245076481914e-7 -3.8609241904943518e-19; -4.861245076487369e-7 0.0002777279464551575 2.735873352730558e-16; -3.85943379257726e-19 2.7370274942050075e-16 -0.000277696818233557;;; -1.4626012574453234e-21 4.863882701148023e-7 -2.637401479213294e-10; -4.863882701147824e-7 -1.557657977123615e-16 -0.000277712381910125; 2.6374014790165455e-10 0.00027771238191012494 1.5574236952443137e-16;;;; -9.693798775119823e-18 -0.011777510733989914 0.011776850693247628; 0.011777510733983297 9.48331430058368e-19 2.846256428844345e-10; -0.011776850693241015 -2.846256439333898e-10 1.0067199819365289e-19;;; -6.894583731466444e-21 4.863882701145528e-7 2.637401484443907e-10; -4.863882701147811e-7 -1.5594644388542407e-16 0.00027771238191012494; -2.637401486727373e-10 -0.000277712381910125 1.5592068003439878e-16;;; 3.130556136009777e-7 -4.866520174529157e-7 -1.3689486660478041e-19; -4.866520174525782e-7 -0.00027772794648245336 -2.734954860940402e-16; -3.8511372770911903e-19 -2.7374346680089633e-16 0.0002776968182138848]
    T2 = [0.9988894593101187 2.2939233802901028e-8 -9.693961756608732e-18; 2.293923381308961e-8 9.987325892856526e-14 2.2543638292181353e-24; 8.316496036966675e-19 -2.9423333181213116e-24 9.934646468007591e-14;;; -5.406181172607965e-10 0.01177717890675867 -0.011777178906347725; 0.011777178897777966 5.409024037548475e-10 -2.7044258700019684e-10; -0.011777178897367017 -2.704425865497314e-10 -1.7230186046708423e-14;;; -2.0216711167529834e-19 -0.011777178897358671 0.011777178896947715; 0.011777178906339374 5.778757151237687e-20 2.704598168745984e-10; -0.011777178905928417 -2.704598170281596e-10 9.577391798957861e-20;;;; -5.406181981389911e-10 0.011777178906765277 0.011777178906341113; 0.011777178897784579 5.409024037551513e-10 2.704425867714562e-10; 0.01177717889736042 2.7044258656919056e-10 -1.7230395024849602e-14;;; 1.0040895017914501e-13 -1.2767290128062186e-11 9.48493733078004e-19; -1.276729011831356e-11 0.00027771229659764916 -1.559763444041006e-16; 9.484927452570197e-19 -1.557336558187481e-16 -0.0002777122965779568;;; -2.225974728205143e-22 6.393253237300295e-12 -6.374037831597291e-12; -6.393253246930552e-12 2.1176987405798918e-13 0.0002777122965779157; 6.374037831690397e-12 -0.00027771229657791565 -2.1176976089970796e-13;;;; 8.767751521417996e-18 -0.01177717889736528 -0.0117771788969411; 0.011777178906345992 5.778772549179366e-20 -2.7045981710303523e-10; 0.01177717890592182 2.7045981700839744e-10 -1.5242568034864584e-19;;; -9.109651133721204e-21 6.393252389487262e-12 6.37403678243116e-12; -6.3932521876009254e-12 2.1176995330461678e-13 -0.00027771229657791576; -6.374036994040965e-12 0.0002777122965779157 -2.1177010845828176e-13;;; 1.004089309946328e-13 -1.92155038291017e-14 1.0067294462752412e-19; -1.92153018792467e-14 -0.0002777122965778749 1.559594509160548e-16; -1.1084349947529846e-19 1.557346227870929e-16 0.000277712296558182]
    # @show f(T1)
    @show gradient(f,T1)[1]
    # @show f(T2)
    @show gradient(f,T2)[1]
end

test()

The output is:

S = [0.9966733066799104, 0.000553884336002314, 0.0005538843359628737, 3.078119489781907e-7, 9.517013661198881e-14, 8.233208632182651e-20, 3.152181524404255e-20, 2.21461089491521e-20, 3.244767819833425e-21]
(gradient(f, T1))[1] = [0.5930936292861588 7624.065189342431 1923.250385191218; -0.008112132803168928 230688.20549618005 -1.0052478316457001e6; 0.023404018301786112 -1.0023995820129112e6 229874.507193745;;; 10.553837932255085 -8.356714691702897e6 -8.366949201460425e6; 1197.341668354167 -98466.40538184528 -104976.6490006776; 1211.9274984744588 -97231.75609906396 -106199.55496457085;;; -10.554681766014166 8.346222273536322e6 8.345128778829346e6; 8082.5080766834435 -106516.22428477029 -96925.69189064114; 8067.390969392194 -107669.12179149309 -95761.76577622129;;;; 7.781678518198429 8.334538750588702e6 -8.366146786695122e6; -12726.386662478204 -57205.113172444835 62112.68877471877; 12706.25462049855 58335.116218700015 -60976.290836456545;;; -2107.7418497581 3400.377447668149 807.9788817186554; 162.73594169763243 255832.81028894515 365745.04965628346; -57.28095383751074 366084.9237695921 256456.64622777177;;; 2048.776221317851 -3289.9263969847116 -694.2366713737454; -144.70786066088436 366071.46541845205 256132.3438346258; 74.40043057047265 256152.1051356172 
365763.55382267665;;;; -7.769985794854151 -8.344364602271023e6 8.370074122969149e6; 3446.234865028609 -61114.81018706875 58203.677269715816; -3426.620530473898 60021.65478200047 -59289.78009630316;;; 2057.1865482553208 -3517.7005815112448 -695.9658722909976; -149.42500941330974 -365977.8189552523 -256186.14348111482; 44.380215997682626 -256095.37475010278 -365843.398443275;;; -1998.2209269612583 3398.7281161764204 580.0189728918299; 131.3971439572214 -256442.73089178288 -365903.2879046907; -61.49905283810004 -365922.9529728504 -255833.7777953479]
S = [0.9966733077096799, 0.00055388416539868, 0.00055388416539868, 3.0781175853684797e-7, 9.517001850668072e-14, 1.691437704129239e-19, 1.361519953025284e-19, 5.109999329154659e-21, 1.6347206747950515e-21]
(gradient(f, T2))[1] = [NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN]

My judgment may be that the SVD decomposition get singular value degenerate, but it may not be. If so, how do I fix SVD? I don’t understand the underlying SVD implementation very well.

I haven’t used the libraries or really done anything related to differentiable programming in Julia. So you should perhaps take my comment with a grain of salt. But I do know that the one dimensional singular subspaces associated with distinct singular values become discontinuous as functions of the matrix elements when those singular values coalesce to give singular values of higher multiplicities. So in this example it looks like you are trying to compute the gradient of a discontinuous function. I’d expect that to fail just on mathematical grounds, without knowing the libraries or what you are trying to do.

In perturbation theory for the SVD, the usual way to address this is to look at higher dimensional singular subspaces associated with the cluster of equal singular values. That becomes continuous and differentiable again if the perturbations (or your parameterization) preserve the multiplicities. I don’t know if that helps you any. Perhaps not, since you have two examples with one in which the singular values are not as close. It doesn’t look like there is anything in your example that suggests the multiplicities are enforced in any way in your problem.

Yes, I realize that degeneracy of singular values can cause problems in taking derivatives.

So, what I’m really going to do is overload the SVD function and add a perturbation to it, right? How do I write this in code? I don’t know much about the underlying SVD implementation, and since I’m not that expert with Julia, I don’t know what to do.
Could you please show me how to do it? Thank you very much!

Yes, I realize that degeneracy of singular values can cause problems in taking derivatives.

It’s a very big problem: if you have repeated singular values, then the derivatives simply don’t exist. You can also expect the derivatives to become less accurate as you get close to a matrix with repeated singular values. It’s not really obvious to me what output would be better than NaN on your example. Any actual numbers would likely be meaningless and misleading. At least with NaN there’s no chance you will interpret it as a meaningful gradient. Perhaps a warning about nondifferentiability would be helpful.

So, what I’m really going to do is overload the SVD function and add a perturbation to it, right? How do I write this in code? I don’t know much about the underlying SVD implementation, and since I’m not that expert with Julia, I don’t know what to do. Could you please show me how to do it? Thank you very much!

I don’t think perturbing the matrix is likely to help, if I understand you correctly. You would just have gradients for a different matrix, which doesn’t seem like it would be helpful, especially given the likely sensitivity of the derivatives. My reference to perturbation theory was just because in that field differentiability of singular subspaces can be a concern and preserving multiplicities can work around it. Unless you know that the class of matrices you are working with preserves multiplicities, what I mentioned is not going to be applicable.

The following is kind of a standard take on how you should react to finding yourself needing the solution of an ill-conditioned problem in numerical linear algebra, so I’ll apologize in advance if it’s unhelpful or it looks condescending because you already know it. Also I don’t know anything about what you are really trying to do here, other than the fact that gradients are involved in some possibly intermediate step. With that said, my own thoughts on finding myself in this sort of situation are typically as follows:

If I find myself trying to compute something (e.g. your gradients) that might not exist or that might be highly ill-conditioned, that tells me I should take a step back and ask if I really need to do that and if there is a better way to approach whatever larger problem I am trying to solve. If my final goal really is to compute gradients and the problem doesn’t have some special structure that preserves multiplicities of singular values, I’d probably accept something like your program as the best thing possible given the nature the problem. If my final goal is something else and the gradients are just an intermediate step, I’d start looking for different approaches to the problem to see if I could find a way to solve the problem by computing only things that are mathematically guaranteed to exist.

By the way, at the risk of contradicting myself about accepting the program, there is one thing that worries me, specifically the trace of U. (I assume that’s what the @ein macro is doing; I haven’t used that before). U is not uniquely determined by M. You can get different traces depending on arbitrary sign choices in the SVD code. If you want a trace of U, to make it unique you should probably impose some choice of signs, even in the case in which the singular values are distinct. I would not trust the svd routine to do that for you. Usually, for this reason, looking at individual elements of U is not meaningful; instead when multiplicities are all one, it is standard to look at the one-dimensional singular subspaces spanned by each of the columns. The subspaces do not depend on an arbitrary choice of sign. With higher multiplicities, you look at higher dimensional spaces. I don’t know why you want a trace or if there’s a way to formulate your problem in terms of subspaces or not. But even without multiplicities, using U[i,i] directly looks like trouble to me.

For my problem, it makes sense to add a small perturbation. I don’t know why, but my tutor told me so.I have no idea how to implement it. I don’t know the underlying structure of SVD in Julia language

My problem is to find the first and second derivatives of a function that contains an unavoidable SVD operation.

Actually, this is just one example that I gave, and it’s probably not quite appropriate. What I really want is to do some operation with U to get a number, so it doesn’t have to calculate the trace of U

I haven’t read the entire discussion, but for differentiating linear algebra routines, there is the package

which provides pure Julia implementations of everything, so that Zygote can differentiate it.

See for example: Does svd() not support forwarddiff? - #3 by maxkapur

This one I had tried, the result will also be nan:

using Zygote
using OMEinsum 
using GenericLinearAlgebra: svd
function f(T)
    @ein M[ b, e, c, d ] := ( T[ a, b, u, f ] * T[ a, c, u, h ] ) * ( T[ l, e, f, m ] * T[ l, d, h, m ] )
    d = size( T, 1 )
    M = reshape( M, ( d^2, d^2 ) )
    M = ( M + M' ) / 2
    U, S,_ = svd( M )
    @show S
    @ein Coef[] := U[ i, i] 
    return Coef[]
end

function test()

    x0=0.5
    T1 = [0.9988894589661916 2.4141401009672692e-8 -9.21073527124551e-18; 2.414140092879679e-8 1.0113486732236356e-13 -3.254292770737505e-24; -2.4058145031123452e-19 2.4387329449624486e-24 1.002665275075667e-13;;; 2.2369956667600393e-8 0.01177751073439025 0.01177685069366975; 0.011777510734396866 5.692640873553409e-10 2.8460655019662047e-10; 0.011776850693676363 2.846065512560657e-10 -1.9093252656828917e-14;;; 8.318087750543311e-19 -0.011777510733978307 -0.011776850693259231; 0.011777510733971711 9.483301236635998e-19 -2.846256431019167e-10; 0.01177685069325263 2.846256439393433e-10 -1.1084366881805116e-19;;;; 2.2369956656976694e-8 0.011777510734401848 -0.011776850693658138; 0.011777510734408457 5.692640873559015e-10 -2.846065504135412e-10; -0.011776850693664753 -2.8460655126146057e-10 -1.909302979277114e-14;;; 3.130556155863574e-7 -4.861245076481914e-7 -3.8609241904943518e-19; -4.861245076487369e-7 0.0002777279464551575 2.735873352730558e-16; -3.85943379257726e-19 2.7370274942050075e-16 -0.000277696818233557;;; -1.4626012574453234e-21 4.863882701148023e-7 -2.637401479213294e-10; -4.863882701147824e-7 -1.557657977123615e-16 -0.000277712381910125; 2.6374014790165455e-10 0.00027771238191012494 1.5574236952443137e-16;;;; -9.693798775119823e-18 -0.011777510733989914 0.011776850693247628; 0.011777510733983297 9.48331430058368e-19 2.846256428844345e-10; -0.011776850693241015 -2.846256439333898e-10 1.0067199819365289e-19;;; -6.894583731466444e-21 4.863882701145528e-7 2.637401484443907e-10; -4.863882701147811e-7 -1.5594644388542407e-16 0.00027771238191012494; -2.637401486727373e-10 -0.000277712381910125 1.5592068003439878e-16;;; 3.130556136009777e-7 -4.866520174529157e-7 -1.3689486660478041e-19; -4.866520174525782e-7 -0.00027772794648245336 -2.734954860940402e-16; -3.8511372770911903e-19 -2.7374346680089633e-16 0.0002776968182138848]
    T2 = [0.9988894593101187 2.2939233802901028e-8 -9.693961756608732e-18; 2.293923381308961e-8 9.987325892856526e-14 2.2543638292181353e-24; 8.316496036966675e-19 -2.9423333181213116e-24 9.934646468007591e-14;;; -5.406181172607965e-10 0.01177717890675867 -0.011777178906347725; 0.011777178897777966 5.409024037548475e-10 -2.7044258700019684e-10; -0.011777178897367017 -2.704425865497314e-10 -1.7230186046708423e-14;;; -2.0216711167529834e-19 -0.011777178897358671 0.011777178896947715; 0.011777178906339374 5.778757151237687e-20 2.704598168745984e-10; -0.011777178905928417 -2.704598170281596e-10 9.577391798957861e-20;;;; -5.406181981389911e-10 0.011777178906765277 0.011777178906341113; 0.011777178897784579 5.409024037551513e-10 2.704425867714562e-10; 0.01177717889736042 2.7044258656919056e-10 -1.7230395024849602e-14;;; 1.0040895017914501e-13 -1.2767290128062186e-11 9.48493733078004e-19; -1.276729011831356e-11 0.00027771229659764916 -1.559763444041006e-16; 9.484927452570197e-19 -1.557336558187481e-16 -0.0002777122965779568;;; -2.225974728205143e-22 6.393253237300295e-12 -6.374037831597291e-12; -6.393253246930552e-12 2.1176987405798918e-13 0.0002777122965779157; 6.374037831690397e-12 -0.00027771229657791565 -2.1176976089970796e-13;;;; 8.767751521417996e-18 -0.01177717889736528 -0.0117771788969411; 0.011777178906345992 5.778772549179366e-20 -2.7045981710303523e-10; 0.01177717890592182 2.7045981700839744e-10 -1.5242568034864584e-19;;; -9.109651133721204e-21 6.393252389487262e-12 6.37403678243116e-12; -6.3932521876009254e-12 2.1176995330461678e-13 -0.00027771229657791576; -6.374036994040965e-12 0.0002777122965779157 -2.1177010845828176e-13;;; 1.004089309946328e-13 -1.92155038291017e-14 1.0067294462752412e-19; -1.92153018792467e-14 -0.0002777122965778749 1.559594509160548e-16; -1.1084349947529846e-19 1.557346227870929e-16 0.000277712296558182]
    # @show f(T1)
    @show gradient(f,T1)[1]
    # @show f(T2)
    @show gradient(f,T2)[1]
end

test()

It gives the same thing:

S = [0.9966733066799104, 0.000553884336002314, 0.0005538843359628737, 3.078119489781907e-7, 9.517013661198881e-14, 8.233208632182651e-20, 3.152181524404255e-20, 2.21461089491521e-20, 3.244767819833425e-21]
(gradient(f, T1))[1] = [0.5930936292861588 7624.065189342431 1923.250385191218; -0.008112132803168928 230688.20549618005 -1.0052478316457001e6; 0.023404018301786112 -1.0023995820129112e6 229874.507193745;;; 10.553837932255085 -8.356714691702897e6 -8.366949201460425e6; 1197.341668354167 -98466.40538184528 -104976.6490006776; 1211.9274984744588 -97231.75609906396 -106199.55496457085;;; -10.554681766014166 8.346222273536322e6 8.345128778829346e6; 8082.5080766834435 -106516.22428477029 -96925.69189064114; 8067.390969392194 -107669.12179149309 -95761.76577622129;;;; 7.781678518198429 8.334538750588702e6 -8.366146786695122e6; -12726.386662478204 -57205.113172444835 62112.68877471877; 12706.25462049855 58335.116218700015 -60976.290836456545;;; -2107.7418497581 3400.377447668149 807.9788817186554; 162.73594169763243 255832.81028894515 365745.04965628346; -57.28095383751074 366084.9237695921 256456.64622777177;;; 2048.776221317851 -3289.9263969847116 -694.2366713737454; -144.70786066088436 366071.46541845205 256132.3438346258; 74.40043057047265 256152.1051356172 
365763.55382267665;;;; -7.769985794854151 -8.344364602271023e6 8.370074122969149e6; 3446.234865028609 -61114.81018706875 58203.677269715816; -3426.620530473898 60021.65478200047 -59289.78009630316;;; 2057.1865482553208 -3517.7005815112448 -695.9658722909976; -149.42500941330974 -365977.8189552523 -256186.14348111482; 44.380215997682626 -256095.37475010278 -365843.398443275;;; -1998.2209269612583 3398.7281161764204 580.0189728918299; 131.3971439572214 -256442.73089178288 -365903.2879046907; -61.49905283810004 -365922.9529728504 -255833.7777953479]
S = [0.9966733077096799, 0.00055388416539868, 0.00055388416539868, 3.0781175853684797e-7, 9.517001850668072e-14, 1.691437704129239e-19, 1.361519953025284e-19, 5.109999329154659e-21, 1.6347206747950515e-21]
(gradient(f, T2))[1] = [NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN;;; NaN NaN NaN; NaN NaN NaN; NaN NaN NaN]

I’ve simplified my example a little bit, so it’s more obvious that it’s the derivative of SVD that causes the nan value

using Zygote
using OMEinsum 
using LinearAlgebra: svd
# using GenericLinearAlgebra: svd
function f(m)
    U, S, V = svd( m )
    @show S
    return U
end

function test()

    m1 = [0.9966729985257432 2.4614026561418658e-8 -1.4662052991923572e-13 2.4614026561416352e-8 0.0002771116035001569 -0.00027709607348218437 1.4660214984262392e-13 0.0002770960734827305 -0.000277080544335815; 2.4614026561418658e-8 0.00027695768894773343 0.00027694216755605775 1.7359966208232817e-10 -2.561804158414186e-10 -3.80086587106461e-15 3.620550268592712e-21 -2.6300518540566453e-10 -6.842708763825336e-12; -1.4662052991923572e-13 0.00027694216755605775 0.0002769266470349557 -3.620883819887297e-21 -2.630090677981402e-10 6.8389895390441526e-12 1.7357959691740416e-10 -2.698334009860051e-10 8.1517453336514e-17; 2.4614026561416352e-8 1.7359966208232817e-10 -3.620883819887297e-21 0.000276957688948279 -2.5618041584193677e-10 2.6300518540057096e-10 -0.0002769421675560576 3.800865593734127e-15 -6.842708758427965e-12; 0.0002771116035001569 -2.561804158414186e-10 -2.630090677981402e-10 -2.5618041584193677e-10 1.5400881455237724e-7 -8.558299720130033e-11 2.630090677930291e-10 8.558299750480673e-11 -1.5399155220674908e-7; -0.00027709607348218437 -3.80086587106461e-15 6.8389895390441526e-12 2.6300518540057096e-10 -8.558299720130033e-11 1.5399155220689474e-7 -2.6983340097525596e-10 -1.5399155195713188e-7 8.557340475982849e-11; 1.4660214984262392e-13 3.620550268592712e-21 1.7357959691740416e-10 -0.0002769421675560576 2.630090677930291e-10 -2.6983340097525596e-10 0.00027692664703440976 6.838989534210779e-12 -8.151774000354929e-17; 0.0002770960734827305 -2.6300518540566453e-10 -2.698334009860051e-10 3.800865593734127e-15 8.558299750480673e-11 -1.5399155195713188e-7 6.838989534210779e-12 1.5399155220689506e-7 -8.557340506331604e-11; -0.000277080544335815 -6.842708763825336e-12 8.1517453336514e-17 -6.842708758427965e-12 -1.5399155220674908e-7 8.557340475982849e-11 -8.151774000354929e-17 -8.557340506331604e-11 1.5397429229547095e-7]
    m2 = [0.9966729995557029 2.2875616735540974e-8 -3.137162197523065e-15 2.2875616733427594e-8 0.00027709598854137304 0.00027709598853086783 3.11781788855719e-15 -0.00027709598853117894 -0.0002770959885211988; 2.2875616735540974e-8 0.00027694208270952886 -0.00027694208269933996 5.807173092847765e-16 1.2716258952361389e-11 3.5324433563055526e-15 7.155811638325043e-23 -6.3598895032823645e-12 -6.359901891756355e-12; -3.137162197523065e-15 -0.00027694208269933996 0.0002769420826896764 -7.200210654868093e-23 -6.356358798933422e-12 6.356367709385669e-12 5.567665781316518e-17 -1.0649687272234072e-17 1.7390145024158106e-18; 2.2875616733427594e-8 5.807173092847765e-16 -7.200210654868093e-23 0.0002769420827098397 1.2716258952368198e-11 6.3598894973125245e-12 0.0002769420826993398 -3.5324421942863478e-15 -6.359901896549566e-12; 0.00027709598854137304 1.2716258952361389e-11 -6.356358798933422e-12 1.2716258952368198e-11 1.5399145740185385e-7 8.557812185028287e-11 6.3563587941425324e-12 -8.557812202318426e-11 -1.5399145739049633e-7; 0.00027709598853086783 3.5324433563055526e-15 6.356367709385669e-12 6.3598894973125245e-12 8.557812185028287e-11 1.539914573904962e-7 -1.0660440531860812e-17 -1.5399145739035038e-7 -8.557812184413537e-11; 3.11781788855719e-15 7.155811638325043e-23 5.567665781316518e-17 0.0002769420826993398 6.3563587941425324e-12 -1.0660440531860812e-17 0.00027694208268936536 6.3563677153382516e-12 -1.73901715513418e-18; -0.00027709598853117894 -6.3598895032823645e-12 -1.0649687272234072e-17 -3.5324421942863478e-15 -8.557812202318426e-11 -1.5399145739035038e-7 6.3563677153382516e-12 1.5399145739049636e-7 8.557812201704097e-11; -0.0002770959885211988 -6.359901891756355e-12 1.7390145024158106e-18 -6.359901896549566e-12 -1.5399145739049633e-7 -8.557812184413537e-11 -1.73901715513418e-18 8.557812201704097e-11 1.5399145737943067e-7]
    @show jacobian(f,m1)
    @show jacobian(f,m2)
end

test()

On your proposal to perturb the matrix, that’s relatively easy to do and the NaN values aren’t really all that robust. I suppose depending on your application, maybe derivatives of a nearby matrix will be good enough. If you are using this for optimization, perhaps you still end up with a descent direction and a few steps with modified derivatives won’t impact ultimate convergence. Even something silly like:

  @show jacobian(f,m2 + eps(norm(m2,1)) * randn(eltype(m2), size(m2)))

got rid of the NaN values on this example. Of course you are still near a matrix for which the Jacobian does not exist and the derivatives are huge. There’s plenty of opportunity to end up with NaN again in your own code, depending on what you do with the derivatives. I would also not expect these “derivatives” to be particularly accurate, even for the perturbed matrix. It all looks pretty dubious to me. But if this is what you are being advised to try for your application, it’s not that hard to try it.

Interestingly, I first ran your example on Zygote 0.6.45 on Julia 1.7.0 using a computer I hadn’t updated for a while. I didn’t get NaN there, just huge derivatives. With the same version of Zygote and Julia 1.8, I do get the NaN values, same as you. I’m not sure what changed.

Finally, I’m new to Zygote and far from confident I know how the pieces are put together, but I think it depends on the code in ChainRule.jl at:

https://github.com/JuliaDiff/ChainRules.jl/blob/main/src/rulesets/LinearAlgebra/factorization.jl

This includes the line

F = T[i == j ? 1 : inv(@inbounds s[j]^2 - s[i]^2) for i = 1:k, j = 1:k]

which forms a matrix from the reciprocal differences of singular values. I’d guess that’s where the problem starts: F likely contains Inf values which I also guess are later multiplied by zero to give NaN. If you want something more principled than a random perturbation, you could try making your own rule that “regularizes” the difference of close singular values by substituting some smallish value when s[j]^2 - s[i]^2 evaluates to zero. I don’t know that there’s any theoretical justification for doing that, but it would change only the problematic computation and nothing else. So maybe that’s better. Keep in mind that I haven’t really tried to debug this (edit: not to imply that this looks like a bug to me) or inspect what’s happening internally in these functions. I’m just eye-balling the functions that I think are being used and looking for where close singular values might cause problems. I am fairly busy for a while, but I am now curious about how this all works and might play with it later this week.

I was going to do it at first, but I wasn’t sure if I was going to get the right number because it felt weird.

It was very strange indeed, and I also compared it on the two machines:

# machines 1, which get NAN:
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> using Pkg

julia> Pkg.status()
      Status `D:\Users\FYF\.julia\environments\v1.7\Project.toml`
  [c29ec348] AbstractDifferentiation v0.4.3
  [6e4b80f9] BenchmarkTools v1.3.1
  [d360d2e6] ChainRulesCore v1.15.3
  [f6369f11] ForwardDiff v0.10.32
  [14197337] GenericLinearAlgebra v0.3.1
  [57b37032] ImplicitDifferentiation v0.2.0
  [de52edbc] Integrals v3.1.2
  [033835bb] JLD2 v0.4.22
  [ebe7aa44] OMEinsum v0.7.1
  [429524aa] Optim v1.7.2
  [91a5bcdd] Plots v1.31.7
  [438e738f] PyCall v1.94.1
  [37e2e3b7] ReverseDiff v1.14.1
  [1ed8b502] SciMLSensitivity v7.6.3
  [276daf66] SpecialFunctions v2.1.7
  [24249f21] SymPy v1.1.7
  [9f7883ad] Tracker v0.2.21
  [e88e6eb3] Zygote v0.6.45
  [8bb1440f] DelimitedFiles

# machines 1, which don't get nan
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)

julia> using Pkg

julia> Pkg.status()
      Status `~/.julia/environments/v1.7/Project.toml`
  [c29ec348] AbstractDifferentiation v0.4.3
  [6e4b80f9] BenchmarkTools v1.3.1
  [5789e2e9] FileIO v1.15.0
  [f6369f11] ForwardDiff v0.10.32
  [14197337] GenericLinearAlgebra v0.3.1
  [de52edbc] Integrals v3.1.2
  [033835bb] JLD2 v0.4.22
  [ebe7aa44] OMEinsum v0.7.1
  [1ed8b502] SciMLSensitivity v7.6.1
  [276daf66] SpecialFunctions v2.1.7
  [24249f21] SymPy v1.1.7
  [e88e6eb3] Zygote v0.6.45
  [8bb1440f] DelimitedFiles

I’m really confused that even though the Julia language environment is the same, it behaves differently in Linux and Windows

You’re amazing. That should be the problem, and I changed it to the following. It shouldn’t be potentially risky, right?

    F = T[i == j ? 1 : inv(@inbounds s[j]^2 - s[i]^2+T(1e-40)) for i = 1:k, j = 1:k]

You’re amazing. That should be the problem, and I changed it to the following. It shouldn’t be potentially risky, right?

    F = T[i == j ? 1 : inv(@inbounds s[j]^2 - s[i]^2+T(1e-40)) for i = 1:k, j = 1:k]

Thanks. I hope it helps. I would make sure your perturbation has the same sign as s[j]^2-s[i]^2. You could make that difference smaller in magnitude here if it is negative. But otherwise, that’s sort of what I had in mind. This feels little more precise, and loosely analogous to regularizing a nearly singular system using modification of singular values, but I’m not sure I could argue that it has any real advantages over perturbing the full matrix. It’s completely a made up ad-hoc idea. I can offer your money back if it doesn’t work, but that’s about it. :slightly_smiling_face:

In terms of risk, you are trying to get numbers from a problem that is close to a problem with no solution. It’s analogous to trying to invert a matrix that is nearly singular. The perturbation moves you over to work with a different matrix, but at best you get the derivatives for a nearby matrix. And those derivatives are going to be very sensitive to perturbations. As a numerical analyst, I would say that the textbook answer is that you started off with an inherently risky problem. But maybe this way you replace derivatives that are approaching \infty with some related large numbers for those derivatives. And maybe that will be good enough for your application. When trying to apply regularization to a nearly singular linear system, there is plenty of theory, but what works and what doesn’t has a lot to do with the origins of and nature of the specific problem. I don’t think you will find any broad theory for this; the test is whether this sort of modification of the derivatives allows you to solve whatever problem you are trying to solve. To me it doesn’t look like too wild a thing to try in an optimization setting, possibly with some tinkering with the perturbation magnitude. But that’s just a guess…