In this formulation we are trying to solve Ax=b for x where each process has it’s own A matrix and b vector. We first average the b value, and then average the hessian vector product, z in my code, across all processes. This means for each cg iteration, each process should be doing the same thing with the same values, up until each process calculates its own hessian vector product again. Your point of a non-associative floating point is valid, but would two different processes really produce two different values for dot product on the same vector?
More interestingly, the cumulative effect of this, within my larger code base, is that the distributive version of my CG can sometimes produce an x that has a negative inner product with the input b. This is bizarre because I can guarantee each process’ A matrix is PSD, and I’m averaging the b vector; each parallel processes’ x vector is the same and can produce the same negative inner product, so even if the algorithm is consistent, it’s consistently wrong. The alternative is to average all the A and b’s and do CG, which produces the correct value – this is more expensive bandwidth wise than the distributed.
- is there a way to encourage consistency between parallel processes?
- are floating points that sensitive that math can algorithmically break like in my second paragraph?
EDIT: as an additional thought, if floating points were that sensitive, wouldn’t we also see inconsistencies in a single process CG similar to the parallel version? I’m still inclined to think that its more likely a weird calculation is happening over floats breaking the algorithm…