Convert collection (Array, DataFrame, ...) to concrete eltype

aplavin · August 29, 2019, 2:53pm

Suppose I have a collection, e.g. DataFrame, with Any eltype but all elements having same concrete type:

df = DataFrame(a=Any[1, 2, 3])

For further processing I need to make it type-stable, but don’t see how to do that. Any obvious way I’m missing here?

Such situation occurs when reading a “dirty” dataset with all kind of wrong values, and cleaning it afterwards.

kevbonham · August 29, 2019, 3:12pm

Perhaps not the most efficient, but:

julia> df = DataFrame(a=Any[1, 2, 3], b=Any[1., 2, 3])
3×2 DataFrame
│ Row │ a   │ b   │
│     │ Any │ Any │
├─────┼─────┼─────┤
│ 1   │ 1   │ 1.0 │
│ 2   │ 2   │ 2   │
│ 3   │ 3   │ 3   │

julia> for n in names(df)
           df[!,n] = [x for x in df[!,n]]
       end

julia> df
3×2 DataFrame
│ Row │ a     │ b    │
│     │ Int64 │ Real │
├─────┼───────┼──────┤
│ 1   │ 1     │ 1.0  │
│ 2   │ 2     │ 2    │
│ 3   │ 3     │ 3    │

nilshg · August 29, 2019, 3:29pm

I might misunderstand but will this do?

julia> using DataFrames

julia> df = DataFrame(a=Any[1, 2, 3])

julia> df.a = Int64.(df.a)

julia> df
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

or maybe

eltype(df.a[1]).(df.a)

if you want it to be more generic (and can rely on that first value…)

aaowens · August 29, 2019, 6:56pm

I find comprehensions tend to solve this automatically for me

julia> using DataFrames

julia> df = DataFrame(a=Any[1, 2, 3])
3×1 DataFrame
│ Row │ a   │
│     │ Any │
├─────┼─────┤
│ 1   │ 1   │
│ 2   │ 2   │
│ 3   │ 3   │
julia> [aa for aa in df.a]
3-element Array{Int64,1}:
 1
 2
 3

aplavin · August 29, 2019, 7:19pm

Thanks for suggestions! For now comprehensions seems like the best easy choice

for n in names(df)
    df[!,n] = [x for x in df[!,n]]
end

Explicitly using type of the first element like typeof(df.a[1]).(df.a) (note typeof instead of eltype as was suggested - so that it works for arrays as well) is definitely less general. E.g. it doesn’t work for Union{..., Nothing} which is pretty common, and other small unions which are handled well by comprehensions.

For larger datasets where performance is important it would be better to have a helper function to skip columns which already have proper types. Unfortunately, I don’t think it’s possible to determine if the type is correct without checking all values anyway…

aaowens · August 29, 2019, 8:30pm

I wonder if this would be a nice feature to be built into DataFrames. Something like narrowtypes!(df) which in simplest form does your loop, but could be made more efficient by skipping any column which already has a concrete type. Like this,

for n in names(df)
    isconcretetype(eltype(df[!, n])) && continue
    df[!,n] = [x for x in df[!,n]]
end

Topic		Replies	Views
Dataframe parses differently if data is passed in columns vs as an array General Usage dataframes	3	370	April 28, 2021
"Re-infer" container eltype New to Julia	0	339	March 1, 2019
Change datatype for subset of DataFrame columns Data dataframes	6	583	February 6, 2022
Can't convert column's data type from Char to Any New to Julia question , package , dataframes	3	313	March 19, 2022
Apply transform() to all DataFrame columns of a certain type? General Usage dataframes	1	370	April 16, 2021

Convert collection (Array, DataFrame, ...) to concrete eltype

Related topics