Simple addition using Transducers via single-threaded foldl , multi-threaded foldxt always calculates two (slightly) different answers

Marc.Cox · May 28, 2021, 11:00pm

Simple addition using Transducers via single-threaded foldl , multi-threaded foldxt calculates two (slightly) different answers .

IOW Is the single-threaded foldl , multi-threaded foldxt, or None of them correct ?

Someone. like maybe @tkf , know what the issue is , or how to address ?

I know Transducers is in Beta-ish , but its really hard to imagine why results differ between single-threaded foldl and multi-threaded foldxt for the SAME Data Set, any ideas ?

using Transducers single-threaded foldl always returns (slightly eg +/- e-10 to e-12) different values than multi-threaded foldxt taken from Parallel processing tutorial @@ Tutorial: Parallelism · Transducers.jl

Issue in a nutshell / MWE

Maybe need more TDD (regression) tests like :

julia> xs = randn (10_000_000)

10000000-element Array {Float 64 ,1}:

julia> maximum(xs)

4.953824504715016

julia> minimum(xs)

-5.157454219276268

## @@ REPL command line per julia>

## Single threaded sum and foldl agree to ALL 16 significant figures / 12 decimals here.

julia> sum(sin(x) for x in xs )

-3847.143374357468

julia> foldl(+, (sin(x) for x in xs ))

-3847.143374357468

julia> foldl(+, Map(sin), xs)

-3847.143374357468

## Multi threaded sum agree to 8th decimal, but are all different answers after 9th decimal IOW after 1.0e-9

julia> foldxt(+, (sin(x) for x in xs ))

-3847.1433743577104

julia> foldxd(+, (sin(x) for x in xs ))

-3847.1433743577104

julia> foldr(+, (sin(x) for x in xs ))

-3847.1433743578973

julia> foldxd(+, ( sinpi( x) for x in xs ))

-283.66017340538497

julia> foldl(+, ( sinpi (x) for x in xs ))

-283.6601734052002

julia> foldxt(+, (sinh(x) for x in xs ))

-4193.185340837956

julia> foldxd(+, (sinh(x) for x in xs ))

-4193.185340837956

julia> foldl(+, (sinh(x) for x in xs ))

-4193.185340837231

## Possibly Related to Multi-threading changing results, but exactly how to address it ?

@@ Multi-threading changing results - #18 by tkf

## Multiple runs Inside a .jl program (i.e. NOT in REPL session) with just a few (10) randoms e.g. xs = randn(10) ## DEBUG

##

##3440-2 DEBUG r_multithread_foldxt - r_onethread_foldl :=-4.44 0892098500626 e-16

##3440-2 DEBUG r_multithread_foldxt - r_onethread_foldl :=0.0

##3440-2 DEBUG r_multithread_foldxt – r_onethread_foldl :=4.44 0892098500626 e-16

*julia> eps()10

2.22 0446049250313 e-15

*julia> eps()2

4.44 0892098500626 e-16

*julia> eps()1.0e+7

2.22 0446049250313 e-9

*julia> eps()2.0e+7

4.44 0892098500626 e-9

## NOTE TIP Above calcs using eps() are guesses at the error max upper bounds which agrees with empirical test results, but why don’t errors cancel in the long run over many millions of items in the Vector/set/matrix , so maybe this points to >> an accumulative systemic (not random) calculation error - Maybe like dropping accuracy by calculating or accumulating using 32-Bit aka 4-Byte Reals instead of 64-Bit aka 8-Byte Reals ??

jzr · May 28, 2021, 11:08pm

You can surround your code in triple backticks to format it as fixed-width.

```
# like this
```

Oscar_Smith · May 28, 2021, 11:13pm

I think this is around the expected error for different summation orders. Independent random errors scale proportional to sqrt(N), so for this example, we would expect sqrt(10000000)≈3000*eps(ans) where ans is the answer of the calculation. (eps(ans) not eps(Float64) because floating point precision is relative, not absolute). Plugging this in, we see an expected error from a different summation order of about 1e-9 which is slightly higher than what you’re seeing.

oxinabox · May 28, 2021, 11:27pm

Addition, in IEEE math isn’t associative.
(I think OP knows this but for future readers).
I.e. (a+b)+c is not the same as a+(b+c).

Parallel addition works by acting like it is.
Which normally you can get away with.

I suspect nothing says errors for nonassociativity of addition have to cancel out.
If they do, then it is probably only under specific circumstances

You don’t have a uniform sample of the number line, though you do have one symmetric about zero.
And you don’t control which order the parallel algorithm is accessing them to ensure that it groups into a group with particular properties.

Marc.Cox · May 29, 2021, 10:29pm

Thank you @Oscar_Smith , @oxinabox - all good points !

So I think you might also be interested in running the Julia code below
where I seem to have uncovered some curious and likely useful things
about using Nemo ball / interval arithmetic instead of IEEE floating point math.

NOTE Nemo ball / interval arithmetic DOES get more accurate (IOW differences between parallel and serial calculations decrease) as the number of items in the Vector/set/matrix Increases - which is as I expected per my previous posit >> Why don’t errors cancel in the long run over many millions of items in the Vector/set/matrix ? ; However as you noted IEEE floating point errors steadily increase as the number of items increases. (i.e. what I’m trying to improve in the Julia program below)

NOTE per when int_dimsize = 100_000_000 the difference between parallel and serial additions of sin(Random- number-x) appears to ALWAYS be -5.5289106626332796e-14 regardless of whatever the Random-numbers are .
So, quite unexpectedly, it appears that Nemo ball / interval arithmetic always returns a predictable/deterministic therefore calculable difference between parallel and serial additions ? This is so unexpected it should be rechecked for errors or a different interpretation so please try out the code below.

# FILE: v0.7-examples-w_Transducers-using_Nemo.jl
println("")
println("##1 Start ! FILE: examples-w_Transducers-using_Nemo.jl Final example works accurately and about 1000x faster performance (than worst case), AND Parallel results EQUAL Serial to 12 Decimals ! So using Nemo ball / interval arithmetic (&& with arbitrary accuracy Arbs) appears to be an all around better method than IEEE floating point math !")
# Implement Simple Pipeline using Transducers ( NO GPU using FoldsCUDA GPU just yet but TODO later )
## TIP Buy some cheap insurance by creating a new environment.
# julia> activate v1.4-EXP3
# ..
using Transducers
# using Transducers Clojure like and so is way to GOAL Recomposable Functional Programming
## ** more syntax about parse out CONVERT Collection to Vector below
1:40 |> Partition(7) |> Filter(x -> prod(x) % 11 == 0) |> Cat() |> Scan(+) |> sum
# 4123

#### Begin TEST set ####
int_dimsize = 10_000_000
##43 DEBUG IEEE floating point r_multithread_foldxt - r_onethread_foldl:=
# 4.0245140553452075e-11
# 6.934897101018578e-12
##180-1 DEBUG using Nemo ball / interval arithmetic r_multithread_foldxt - r_onethread_foldl:=
# -2.724487302430134e-13
# -2.724487302430134e-13

## int_dimsize = 100_000_000
##43 DEBUG IEEE floating point r_multithread_foldxt - r_onethread_foldl:=
# -1.304215402342379e-9
# 2.6793713914230466e-9
##180-1 DEBUG using Nemo ball / interval arithmetic r_multithread_foldxt - r_onethread_foldl:=
# -5.5289106626332796e-14
# -5.5289106626332796e-14

# NOTE Some curious and likely useful things about using Nemo ball / interval arithmetic instead of IEEE floating point.
# # 	NOTE Nemo ball / interval arithmetic DOES get more accurate (IOW differences between parallel and serial calculations decrease) as the number of items in the Vector/set/matrix Increases - which is as I expected per my previous posit >> Why don’t errors cancel in the long run over many millions of items in the Vector/set/matrix ? ;
# 		however IEEE floating point errors steadily increase as the number of items increases. (i.e. what I'm trying to DEBUG here in this program)
# #
# 
# # 	NOTE per when int_dimsize = 100_000_000 the difference between parallel and serial calculations appears to ALWAYS be -5.5289106626332796e-14 Regardless of whatever Random- numbers I take the sin(Random- number-x)!
# So, quite unexpectedly, it appears that Nemo ball / interval arithmetic always returns a predictable/deterministic therefore calculable difference between parallel and serial calculations. This is so unexpected it should be rechecked.
# 
# ### END TEST set results ####

xs = randn(int_dimsize)

using BenchmarkTools
# ## Singled Threaded
# julia>
# NOTE WAS Transducers.Map(sin) -- DEBUG WAS -- @btime foldl(+, Transducers.Map(sin), xs)
# DEBUG WARNING: both Nemo and Transducers export "Map"; uses of it in module Main must be qualified
# LoadError: UndefVarError: Map not defined
# NOTE WAS Transducers.Map(sin) --
println("##21 ! NOTE next line works but - NOTE However, it is frustrating that Parallel results NOT equal Serial - SEE better methods farther below !")
# 5229.266346349295
@btime foldl(+, Transducers.Map(sin), xs)
r_onethread_foldl = foldl(+, Transducers.Map(sin), xs)
# 	108.623 ms (1 allocation: 16 bytes)
# 	-308.7946673338581
# DEBUG -- MethodError: no method matching AbstractAlgebra.Map(::typeof(sin)) -- @btime foldl(+, Nemo.Map(Nemo.sin), xs)

#
# ## Thread-based parallelism
# ## Just replace foldl with foldxt, to make use of multiple cores:
# julia>
# NOTE WAS Transducers.Map(sin) --
@btime foldxt(+, Transducers.Map(sin), xs)
r_multithread_foldxt = foldxt(+, Transducers.Map(sin), xs)
println("##43 DEBUG IEEE floating point r_multithread_foldxt - r_onethread_foldl:=")
println(r_multithread_foldxt - r_onethread_foldl)
# 7.860 ms (184 allocations: 13.67 KiB)
# 	-308.7946673340768
# DEBUG -- MethodError: no method matching AbstractAlgebra.Map(::typeof(sin)) --@btime foldxt(+, Nemo.Map(Nemo.sin), xs)

## NOTE However, it is frustrating that Parallel results NOT equal Serial
## NOTE Try to adapt and overcome IEEE limitations using Nemo ball / interval arithmetic
##
using Nemo
# Cool benchmarks using Nemo https://nemocas.org/benchmarks.html
# General help using Nemo https://nemocas.github.io/Nemo.jl/dev/arb/#Real-ball-functionality
##
# RR = RealField(64)
# # Real Field with 64 bits of precision and error bounds
# # Nemo.sin(RR(x) + CC("0"))
# # foldxt(+, Map(sin), xs)
# # julia>
# CC = ComplexField(64)
# Complex Field with 64 bits of precision and error bounds
# julia>
# NOTE WARNING this works but consumes roughly 1000x more CPU-Time !
# println("##50 ! NOTE WARNING next line works but consumes roughly 1000x more CPU-Time than better methods below !")
# @btime foldxt(+, (Nemo.sin(RR(x) + CC("0")) for x in xs ))
# [5229.26634635 +/- 2.13e-9]
# julia>
## NOTE However, it is frustrating that Parallel results NOT equal Serial
## NOTE Adapt and overcome IEEE limitations
# julia>
RR = RealField(64)
# Real Field with 64 bits of precision and error bounds

# .. nixed for brevity ..
# julia> c = RR(12)
# 12.000000000000000000
#
# julia> x = ball(a, b)
# [+/- 3.01]
#
# julia> y = ball(c, b)
# [1e+1 +/- 4.01]
#
# julia> mid = midpoint(x)
# 1.0000000000000000000
#
# julia>
# rad = radius(x)
# [2.0000000037252902985 +/- 3.81e-20]

# NOTE TIP
#	https://nemocas.github.io/Nemo.jl/dev/arb/#Random-generation
# 	Section: Random generation - at bottom of the page

# julia>
RR = RealField(128)
# Real Field with 128 bits of precision and error bounds
## NOTE Results when RR = RealField(128) are Parallel results EQUAL Serial to better than 12 Decimals ! >> PER >> ## ##160-1 DEBUG r_multithread_foldxt - r_onethread_foldl:=
# -2.724487302430134e-13
#
RR = RealField(64)
# -2.724487302430134e-13 <<  NOTE Same result as 128
RR = RealField(32)
# -2.724487302430134e-13 <<  NOTE Surprise Same result as 128 AND 64 Bit Arbs

##
## Test out various Arb randtype per https://nemocas.github.io/Nemo.jl/dev/arb/#Random-generation
# julia> a = rand(RR)
# [+/- 0.747]
#
# julia> b = rand(RR; randtype = :null_exact)
# [0.226218128795578887826714059196 +/- 3.11e-31]
#
# julia> c = rand(RR; randtype = :exact)
# -0.0002288818359375000000000000000000
#
# julia> d = rand(RR; randtype = :special)
# [-0.99999999999997157829056959599 +/- 3.36e-30]
#
# julia> b = rand(RR; randtype = :null_exact)
# [0.8248643500507181301 +/- 7.46e-20]

## Progress !
# julia> A = RR[rand(RR; randtype = :null_exact) 2.2 3.3;]
# [[0.9871983922730296429 +/- 8.31e-20]   [2.2000000000000001776 +/- 3.57e-20]   [3.2999999999999998224 +/- 3.57e-20]]
#
# 	foldxt(+, (x * y for x in 1:3, y in 1:3))
#
# julia> Matrix{ArbField}(undef, 10, 1)
# 10×1 Array{ArbField,2}:
#  #undef
## Maybe First attempt w RR = RealField(64) ## Real Field with 64 bits of precision and error bounds
arbmat = Matrix{ArbField}(undef, int_dimsize, 1)
## Below works, but changed to fairer comparison i.e. Compare Apples to Apples
# for idx in eachindex(arbmat)
# 	arbmat(idx) = rand(RR; randtype = :null_exact)
# end
## NOTE Try to Compare Apples to Apples by
##	putting/converting xs values in/to arbmat
	# Example data from xs showing IEEE floating point conversion to Arb-itrary Accuracy numbers with error bars using Nemo.
	# julia> RR(xs[1])
	# [0.343814452 +/- 5.20e-10]
	# julia> RR(xs[2])
	# [-1.544338399 +/- 8.53e-10]
println("##140 NOTE Below tries to Compare Apples to Apples - as best as possible ? ..")
for idx in 1:int_dimsize
	arbmat(idx) = RR(xs[idx])
end

## ! Attempted to DEBUG putting Arb in Function was a NOGO. !
## ! Attempted to DEBUG putting Arb in Function was a NOGO. !
## function minmaxarbmat(arbmat) ## ERROR: UndefRefError: access to undefined reference
	# global arbmin = RR("0.0")
	# global arbmax = RR("0.0")
	# for idx in eachindex(arbmat)
	#    ## arbx = arbmat[idx]
	#    ## if (arbx < arbmin) arbmin = arbx end
	#    ## if (arbx > arbmax) arbmax = arbx end
	#    if (arbmat[idx] < arbmin) arbmin = arbmat[idx] end
	#    if (arbmat[idx] > arbmax) arbmax = arbmat[idx] end
	# end
	## 	return (arbmin,arbmax)
## end
## minmaxarbmat(arbmat)
## FAIL w ERROR: foldxt(+, (Nemo.sin(x) for x in arbmat ))
## FAIL w ERROR: UndefRefError: access to undefined reference

println("##150 ! NOTE next lines works with comparably fast performance, AND Parallel calculation results EQUALS Serial calculation results to better than 12 Decimals ! using Nemo ball / interval arithmetic (&& with arbitrary accuracy Arbs) appears to be an all around better method than IEEE floating point math !")
## ! NOTE TIP BELOW WORKS with Nemo Arb Array to 12 decimals Accuracy - Take THAT IEEE floating point math lol !
# julia>
@btime foldl(+, (Nemo.sin(x) for x in eachindex(arbmat) ))
r_onethread_foldl = foldl(+, (Nemo.sin(x) for x in eachindex(arbmat) ))
# 1.4111883712180104
# julia>
@btime foldxt(+, (Nemo.sin(x) for x in eachindex(arbmat) ))
r_multithread_foldxt = foldxt(+, (Nemo.sin(x) for x in eachindex(arbmat) ))
# 1.4111883712180107
println("##180-1 DEBUG using Nemo ball / interval arithmetic r_multithread_foldxt - r_onethread_foldl:=")
println(r_multithread_foldxt - r_onethread_foldl)
## NOTE Results when RR = RealField(128) are Parallel results EQUAL Serial to better than 12 Decimals ! >> PER >> ## ##160-1 DEBUG r_multithread_foldxt - r_onethread_foldl:=
# -2.724487302430134e-13
# -2.724487302430134e-13 <<  NOTE 64 Bit Calculations Same result as 128
# -2.724487302430134e-13
# NOTE Using Nemo Ball Arithmetic 32 Bit Accuracy Calculations yield SAME CONSTANT result as 64 AND SAME as 128 Bit Arbs
##
println("##200 EOJ")

Topic		Replies	Views
Parallel reductions Julia at Scale	23	5281	June 19, 2022
ANN: Transducers.jl 0.3. taking "zeros" seriously, type stability improvements, fusible groupby, OnlineStats, "GPU support", and more Package Announcements	29	3223	December 2, 2020
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7173	January 29, 2019
ANN: Transducers.jl, efficient and composable algorithms for map- and reduce-like operations Package Announcements	24	5479	March 25, 2019
Could Transducers help here? Performance performance , transducers	4	998	January 23, 2019

Simple addition using Transducers via single-threaded foldl , multi-threaded foldxt always calculates two (slightly) different answers

Related topics