Squeeze/compress element type of array

Does there exist any functionality which given an array returns a new array with the smallest eltype which can represent the data, i.e

julia> squeezeeltype([1,2,3])
3-element Array{UInt8,1}:
 0x01
 0x02
 0x03

It seems to be quite straightforward but a bit messy to implement a ‘good enough’ way for ints and maybe floats, e.g. supply a set of types to check vs (and a tolerance for floats).

Use case for me is basically dataframes from multiple files read using Distributed and I’m hoping that by squeezing the element type I can return more data to the host without going oom.

The fact that my searches returned null makes me think that this is not meaningful to do though.

you can use type_compress from JDF.jl e.g.

using JDF

x = type_compress([1,2,3])
1 Like

Thanks, that looks like more or less exactly what I wanted.

Since the JDF package is a bit large if this is all you want, you can also use something like this:

"""
    squeezeeltype(x; tol_kw...)

Return a collection that has the strictest type that will contain 
all of its elements. Only really meant to work on `Number` types. 
Keyword arguments are passed to `isapprox`, which is used to 
determine whether to truncate floating point representation.
"""
function squeezeeltype(x; tol_kw...)
    T = mapreduce(y->_mintype(y; tol_kw...), promote_type, x)
    convert.(T, x)
end

function _mintype(x::AbstractFloat; tol_kw...)
    (isnan(x) || isinf(x)) && return Float16
    for T in (Float16, Float32, Float64)
        abs(x) <= floatmax(T) && 
        isapprox(T(x), x; tol_kw...) &&
        return T
    end
    BigFloat
end


function _mintype(x::Integer; kw...)
    x > zero(x) ? 
    _mintype_int(x, (UInt8, UInt16, UInt32, UInt64, UInt128)) :
    _mintype_int(x, (Int8, Int16, Int32, Int64, Int128))
end

function _mintype_int(x::Integer, Ts)
    for T in Ts
       typemin(T) <= x <= typemax(T) && return T
    end
    BigInt
end

# fallback
_mintype(x; kw...) = typeof(x)
julia> squeezeeltype([1, -3.0, 2e6, missing, NaN], rtol = 0.0001)
5-element Array{Union{Missing, Float32},1}:
     1.0f0
    -3.0f0
     2.0f6
      missing
 NaN32

julia> squeezeeltype([1, 2, 3])
3-element Array{UInt8,1}:
 0x01
 0x02
 0x03

Notes:

  1. This might be a prime use case for @nospecialize, but I’m really no expert.
  2. You could make the argument that round floats should return an Integer type. You can slightly modify the AbstractFloat case to do this using isinteger(x).
  3. You can also surely do the above (probably faster) using the bit representation of the numbers (i.e. checking which is the first nonzero byte with bit shifts, etc.) but that’s left as an exercise for the reader :stuck_out_tongue_winking_eye:
1 Like