Squeeze/compress element type of array

DrChainsaw · September 22, 2020, 12:02pm

Does there exist any functionality which given an array returns a new array with the smallest eltype which can represent the data, i.e

julia> squeezeeltype([1,2,3])
3-element Array{UInt8,1}:
 0x01
 0x02
 0x03

It seems to be quite straightforward but a bit messy to implement a ‘good enough’ way for ints and maybe floats, e.g. supply a set of types to check vs (and a tolerance for floats).

Use case for me is basically dataframes from multiple files read using Distributed and I’m hoping that by squeezing the element type I can return more data to the host without going oom.

The fact that my searches returned null makes me think that this is not meaningful to do though.

xiaodai · September 22, 2020, 12:58pm

you can use type_compress from JDF.jl e.g.

using JDF

x = type_compress([1,2,3])

DrChainsaw · September 23, 2020, 2:06pm

Thanks, that looks like more or less exactly what I wanted.

tomerarnon · September 23, 2020, 2:27pm

Since the JDF package is a bit large if this is all you want, you can also use something like this:

"""
    squeezeeltype(x; tol_kw...)

Return a collection that has the strictest type that will contain 
all of its elements. Only really meant to work on `Number` types. 
Keyword arguments are passed to `isapprox`, which is used to 
determine whether to truncate floating point representation.
"""
function squeezeeltype(x; tol_kw...)
    T = mapreduce(y->_mintype(y; tol_kw...), promote_type, x)
    convert.(T, x)
end

function _mintype(x::AbstractFloat; tol_kw...)
    (isnan(x) || isinf(x)) && return Float16
    for T in (Float16, Float32, Float64)
        abs(x) <= floatmax(T) && 
        isapprox(T(x), x; tol_kw...) &&
        return T
    end
    BigFloat
end


function _mintype(x::Integer; kw...)
    x > zero(x) ? 
    _mintype_int(x, (UInt8, UInt16, UInt32, UInt64, UInt128)) :
    _mintype_int(x, (Int8, Int16, Int32, Int64, Int128))
end

function _mintype_int(x::Integer, Ts)
    for T in Ts
       typemin(T) <= x <= typemax(T) && return T
    end
    BigInt
end

# fallback
_mintype(x; kw...) = typeof(x)

julia> squeezeeltype([1, -3.0, 2e6, missing, NaN], rtol = 0.0001)
5-element Array{Union{Missing, Float32},1}:
     1.0f0
    -3.0f0
     2.0f6
      missing
 NaN32

julia> squeezeeltype([1, 2, 3])
3-element Array{UInt8,1}:
 0x01
 0x02
 0x03

Notes:

This might be a prime use case for @nospecialize, but I’m really no expert.
You could make the argument that round floats should return an Integer type. You can slightly modify the AbstractFloat case to do this using isinteger(x).
You can also surely do the above (probably faster) using the bit representation of the numbers (i.e. checking which is the first nonzero byte with bit shifts, etc.) but that’s left as an exercise for the reader

Topic		Replies	Views
How to force mixed-type array of just ints and floats? General Usage question	4	2685	May 16, 2018
How to convert Vector{Any} to a narrower type? General Usage	11	2351	June 30, 2023
Hi all. I have a function in which I create an array but the type of the array is General Usage	1	257	March 7, 2021
Why does eltype not return upper type bound? General Usage question , array , eltype	7	322	December 3, 2023
If I have a an object of type `Array{Float32, 2}`, using `eltype` gives me `Float3 General Usage	1	233	March 8, 2021

Squeeze/compress element type of array

Related topics