Mean of vector of reshaped reinterpreted matrices gives completely wrong result

Ethan_Diao · July 14, 2023, 4:02am

The code is as follows. I have to create a new matrix from the reshaped reinterpreted matrix to get the right result.

vector_of_arr_1 = [reshape(reinterpret(UInt16, rand(UInt8, 3*3*2)), (3, 3)) for _ in 1:10];
vector_of_arr_2 = [reshape(rand(UInt8, 3*3), (3, 3)) for _ in 1:10];
vector_of_arr_3 = [reinterpret(UInt16, rand(UInt16, 8)) for _ in 1:10]
@assert all(mean(vector_of_arr_1) .== mean(arr->Matrix(arr), vector_of_arr_1)) # fail
@assert all(mean(vector_of_arr_2) .== mean(arr->Matrix(arr), vector_of_arr_2)) # success
@assert all(mean(vector_of_arr_3) .== mean(arr->Vector(arr), vector_of_arr_3)) # success

Ethan_Diao · July 14, 2023, 5:45am

Another example. Maybe mean was not supposed to be used with vector of arrays?

# a vector of [0xffff]
vector_of_arr_4 = [reinterpret(UInt16, fill(UInt8(255), 2)) for _ in 1:10]
# the following assertion fails, the lhs is [6552.6] while the rhs is [65535.0]
@assert all(mean(vector_of_arr_4) .== mean(arr->Vector(arr), vector_of_arr_4)) # fail

Benny · July 14, 2023, 6:01am

There are many ways to compute a mean, and in the one where everything is added before dividing by the number of elements, you’re hitting integer overflows. typemax(UInt16) == 65535, so any time an addition of real numbers go above that, the integer just cycles down.

For some reason, mean(arr->Matrix(arr), vector_of_arr_1) dodges the overflow issue, possibly by converting the values to the result Float64 before adding.

julia> fva1 = convert.(Matrix{Float64}, vector_of_arr_1); # hard to overflow
julia> sum(fva1) # /10 for mean
3×3 Matrix{Float64}:
 360946.0  404489.0  383626.0
 333488.0  332143.0  279101.0
 349868.0  304757.0  279399.0
julia> convert(Matrix{Float64}, sum(vector_of_arr_1)) # overflow!
3×3 Matrix{Float64}:
 33266.0  11273.0  55946.0
  5808.0   4463.0  16957.0
 22188.0  42613.0  17255.0
julia> convert(Matrix{Float64}, sum(arr->Matrix(arr), vector_of_arr_1)) # overflow!
3×3 Matrix{Float64}:
 33266.0  11273.0  55946.0
  5808.0   4463.0  16957.0
 22188.0  42613.0  17255.0
julia> convert(Matrix{Float64}, mean(vector_of_arr_1)) # overflow.
3×3 Matrix{Float64}:
 3326.6  1127.3  5594.6
  580.8   446.3  1695.7
 2218.8  4261.3  1725.5
julia> convert(Matrix{Float64}, mean(arr->Matrix(arr), vector_of_arr_1)) # no overflow?!
3×3 Matrix{Float64}:
 36094.6  40448.9  38362.6
 33348.8  33214.3  27910.1
 34986.8  30475.7  27939.9

The other lines’ mean all dodge the sum overflow issue:

julia> fva2 = convert.(Matrix{Float64}, vector_of_arr_2);
julia> fva3 = convert.(Vector{Float64}, vector_of_arr_3);
julia> all(mean(fva2) .== mean(vector_of_arr_2))
true
julia> all(mean(fva3) .== mean(vector_of_arr_3))
true
julia> all(sum(fva2) .== sum(vector_of_arr_2))
false
julia> all(sum(fva3) .== sum(vector_of_arr_3))
false

It probably is the different types dispatching to different mean methods.

julia> typeof(vector_of_arr_1)
Vector{ReshapedArray{UInt16, 2, ReinterpretArray{UInt16, 1, UInt8, Vector{UInt8}, false}, Tuple{}}} (alias for Array{Base.ReshapedArray{UInt16, 2, Base.ReinterpretArray{UInt16, 1, UInt8, Array{UInt8, 1}, false}, Tuple{}}, 1})
julia> typeof(vector_of_arr_2)
Vector{Matrix{UInt8}} (alias for Array{Array{UInt8, 2}, 1})
julia> typeof(vector_of_arr_3)
Vector{Vector{UInt16}} (alias for Array{Array{UInt16, 1}, 1})

RobertGregg · July 14, 2023, 6:16am

So this is the relevant part of Statistics.jl:

_mean_promote(x::T, y::S) where {T,S} = convert(promote_type(T, S), y)

# ::Dims is there to force specializing on Colon (as it is a Function)
function _mean(f, A::AbstractArray, dims::Dims=:) where Dims
    isempty(A) && return sum(f, A, dims=dims)/0
    if dims === (:)
        n = length(A)
    else
        n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
    end
    x1 = f(first(A)) / 1
    result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims) #problem here
    if dims === (:)
        return result / n
    else
        return result ./= n
    end
end

For some reason, when you call mean without the anonymous function the result variable correctly remains a UInt16, but when you add the arr->Matrix(arr) function, result is prematurely promoted to a Float64.

Ethan_Diao · July 14, 2023, 6:38am

Found the problem. So when the element type is ReinterpretedArray, the result type is inferred as AbstractVector, while when the element type is Vector{UInt16}, the result type is inferred as Vector{Float64}.