Why did I get IO error when broadcasting higher precision (bits >= 98) ArbFloat variables to workers?

I used the following code to test the data allocation on distributed workers. I found that broadcasting an ArbFloat variable to workers with precision equal to or higher than 98 bits will cause the IO error.
Code:

using ArbNumerics, Distributed
rmprocs(workers())
addprocs(1);

@everywhere dn = 90
@everywhere using ArbNumerics
@everywhere setworkingprecision(ArbFloat,bits=dn)

a = ArbFloat(pi)

@everywhere function check_data(x)
println("Worker $(myid()): Data check: ", typeof(x), " with value: ", x)
end

for i = 1:10
@everywhere setworkingprecision(ArbFloat,bits=dn+$i)
a = ArbFloat(pi)
try
@everywhere check_data(a) catch e println("The error happened at bits = (dn+i) ",e)
end
end

“error”:

Worker 1: Data check: ArbFloat{91} with value: 3.141592653589793238462643383279503
From worker 3: Worker 3: Data check: ArbFloat{91} with value: 3.141592653589793238462643383279503
Worker 1: Data check: ArbFloat{92} with value: 3.141592653589793238462643383279503
From worker 3: Worker 3: Data check: ArbFloat{92} with value: 3.141592653589793238462643383279503
Worker 1: Data check: ArbFloat{93} with value: 3.1415926535897932384626433832795029
From worker 3: Worker 3: Data check: ArbFloat{93} with value: 3.1415926535897932384626433832795029
Worker 1: Data check: ArbFloat{94} with value: 3.1415926535897932384626433832795029
From worker 3: Worker 3: Data check: ArbFloat{94} with value: 3.1415926535897932384626433832795029
Worker 1: Data check: ArbFloat{95} with value: 3.14159265358979323846264338327950288
From worker 3: Worker 3: Data check: ArbFloat{95} with value: 3.14159265358979323846264338327950288
Worker 1: Data check: ArbFloat{96} with value: 3.14159265358979323846264338327950288
From worker 3: Worker 3: Data check: ArbFloat{96} with value: 3.14159265358979323846264338327950288
Worker 1: Data check: ArbFloat{97} with value: 3.141592653589793238462643383279502884
From worker 3: Worker 3: Data check: ArbFloat{97} with value: 3.141592653589793238462643383279502884
Worker 1: Data check: ArbFloat{98} with value: 3.141592653589793238462643383279502884
From worker 3: Worker 3: Data check: ArbFloat{98} with value:
Worker 3 terminated.

The error happened at bits = 98

Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
[1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
@ Base .\stream.jl:410
[2] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base .\stream.jl:981
[3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base .\stream.jl:987
[4] unsafe_read
@ .\io.jl:891 [inlined]
[5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base .\io.jl:890
[6] read!
@ .\io.jl:895 [inlined]
[7] deserialize_hdr_raw
@ C:\Users\dell.julia\juliaup\julia-1.11.2+0.x64.w64.mingw32\share\julia\stdlib\v1.11\Distributed\src\messages.jl:167 [inlined]
[8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\dell.julia\juliaup\julia-1.11.2+0.x64.w64.mingw32\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:172
[9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed C:\Users\dell.julia\juliaup\julia-1.11.2+0.x64.w64.mingw32\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:133
[10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed C:\Users\dell.julia\juliaup\julia-1.11.2+0.x64.w64.mingw32\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:121
CompositeException(Any[ProcessExitedException(3)])
Worker 1: Data check: ArbFloat{99} with value: 3.1415926535897932384626433832795028842
Worker 1: Data check: ArbFloat{100} with value: 3.1415926535897932384626433832795028842

1 Like
using ArbNumerics, Distributed
addprocs(2);
@everywhere dn = 90
@everywhere using ArbNumerics
@everywhere setworkingprecision(ArbFloat,bits=dn)
a = ArbFloat(pi)
@everywhere function check_data(x)
    println("Worker $(myid()): Data check: ", typeof(x), " with value: ", x)
end

@everywhere check_data($b)
for i = 1:10
    @everywhere setworkingprecision(ArbFloat,bits=dn+$i)
    a = ArbFloat(pi)
    try
        @everywhere check_data($a)
    catch e
        println("The error happened at bits = $(dn+i)! ",e)
    end
end
1 Like

If I use BigFloat, no error will appear. The origin of the problem seems to be the ArbNumerics package.

Hi, and welcome to the Julia community!

The Julia machinery concerning serialisation does not seem to be properly implemented in ArbNumerics.jl (which relies on ccalls). For example,

using Distributed, ArbNumerics

setworkingprecision(10_000)

filename = tempname()
Distributed.serialize(filename, ArbFloat(pi))
println("ArbFloat: ",  filesize(filename))  # 89

Distributed.serialize(filename, BigFloat(pi, precision=10_000))
println("BigFloat: ",  filesize(filename))  # 1324

So we store the approximately 10 000 bits required to represent ArbFloat(pi) using only 89 bytes. Obviously something goes wrong here. In contrast, BigFloat uses a reasonable number of bytes (a bit more than 1250).

(BigFloat also relies on ccalls, but seeing it is a core part of Julia, I imagine proper care has indeed been taken to make it fit into the rest of the framework.)


I’m not sure how ArbFloat works internally, but you can see that the dump output changes drastically at 98 bits and no longer represents the value itself:

julia> for nbits = 96:99
           setworkingprecision(ArbFloat, bits=nbits)
           dump(ArbFloat(pi))
       end
ArbFloat{96}
  exp: Int64 2
  size: UInt64 0x0000000000000004
  d1: UInt64 0xc4c6628b80dc1cd1
  d2: UInt64 0xc90fdaa22168c234
ArbFloat{97}
  exp: Int64 2
  size: UInt64 0x0000000000000004
  d1: UInt64 0xc4c6628b80dc1cd1
  d2: UInt64 0xc90fdaa22168c234
ArbFloat{98}
  exp: Int64 2
  size: UInt64 0x0000000000000006
  d1: UInt64 0x0000000000000003
  d2: UInt64 0x000001bce4e04ba0
ArbFloat{99}
  exp: Int64 2
  size: UInt64 0x0000000000000006
  d1: UInt64 0x0000000000000003
  d2: UInt64 0x000001bce4e04b80

julia> (big(0xc4c6628b80dc1cd1) + big(0xc90fdaa22168c234) << 64) / big(2)^(128 - 2)  # presumable interpretation of the ArbFloat{97} bits
3.141592653589793238462643383279502884195286358297445035858198682759360371345934
1 Like

Yes, the problem seems to be that ArbNumerics needs to overload Serialization.serialize for ArbFloat (similar to e.g. how BigInt is serialized).

Right now, it is using the default serialize method, which just writes the contents of the Julia struct. Since this probably contains a C pointer, that won’t work.

I filed an issue for further discussion of this: implement Serialization serialize/deserialize · Issue #77 · JeffreySarnoff/ArbNumerics.jl · GitHub

4 Likes

Thank you! Might get the feedback from the dev after xmas :+1:

Jeff has responded in the channel of distributed on Julia slack. He is on vacation and will not be back until Mid January