Distributed Conjugate Gradient Numerical Errors

klowrey · October 22, 2017, 5:03pm

In this formulation we are trying to solve Ax=b for x where each process has it’s own A matrix and b vector. We first average the b value, and then average the hessian vector product, z in my code, across all processes. This means for each cg iteration, each process should be doing the same thing with the same values, up until each process calculates its own hessian vector product again. Your point of a non-associative floating point is valid, but would two different processes really produce two different values for dot product on the same vector?

More interestingly, the cumulative effect of this, within my larger code base, is that the distributive version of my CG can sometimes produce an x that has a negative inner product with the input b. This is bizarre because I can guarantee each process’ A matrix is PSD, and I’m averaging the b vector; each parallel processes’ x vector is the same and can produce the same negative inner product, so even if the algorithm is consistent, it’s consistently wrong. The alternative is to average all the A and b’s and do CG, which produces the correct value – this is more expensive bandwidth wise than the distributed.

is there a way to encourage consistency between parallel processes?
are floating points that sensitive that math can algorithmically break like in my second paragraph?

EDIT: as an additional thought, if floating points were that sensitive, wouldn’t we also see inconsistencies in a single process CG similar to the parallel version? I’m still inclined to think that its more likely a weird calculation is happening over floats breaking the algorithm…

Topic		Replies	Views
Strange comportment of @distributed Performance question , distributed , loops	4	692	March 25, 2020
Implementing parallel computing in finite element problems Numerics	6	2070	August 15, 2018
OpenBLAS on master -- getting wrong results with multithreading? Numerics	2	1050	July 28, 2018
Parallel version of code not giving the right answer New to Julia	2	339	November 21, 2020
Questions on a number of code acceleration techniques General Usage performance , hpc , parallel	11	1807	July 8, 2017

Distributed Conjugate Gradient Numerical Errors

Related topics