Disable SentinelArrays for CSV.read

robsmith11 · February 8, 2021, 2:14pm

I’m reading a large CSV file of many Float64 values and then applying many possible transformations before doing a least squares fit.

I noticed that the type is not plain Float64s despite there being no missing values:

julia> d = CSV.read("/data/m4.csv", NamedTuple)
julia> typeof(d)
NamedTuple{(:hu, :hc, :hf, :hs, :bu, :bs, :cs), NTuple{7, SentinelArrays.ChainedVector{Float64, Vector{Float64}}}}

If I manually convert to plain Float64s (btw is there an easier way to do this?), I get a 28% speed up in my transformations code:

julia> typeof((bs=d.bs[:], bu=d.bu[:], cs=d.cs[:], hc=d.hc[:], hf=d.hf[:], hs=d.hs[:], hu=d.hu[:]))
NamedTuple{(:bs, :bu, :cs, :hc, :hf, :hs, :hu), NTuple{7, Vector{Float64}}}

Is there any way to tell CSV.read to not use SentinelArrays so I don’t have to do the conversion myself?

EDIT:
As a workaround, I’m using this:

d = JuliaDB.loadtable("/data/m4.csv").columns.columns

dereksz · June 23, 2021, 8:45am

I hit the same issue, and have developed a very efficient - but also highly fragile - solution.

If you look to the layout of SentinelArray, there is a data attribute that holds the underlying vector, so I simply do fd[!,col] = fd[!,col].data. However, if your CSV is running multithreaded, you actually end up with a set of SentinelArray packaged within a ChainedArray. To address both situations it’s easiest to use functions multiple-dispatch to make sure the “de_sentinelization” happens correctly:

# _de_sentinelise digs into Sentinel & Chained Vectors and "pulls up" the Vector under the Sentinel
function _de_sentinelise(v::AbstractVector)
    # @info "In AbstractVector with type $(typeof(v))"
    return v
end

function _de_sentinelise(v::SentinelVector{T}) where {T<:Base.AbstractFloat}
    # @info "In SentinelArray with type $(typeof(v))"
    return v.data
end

function _de_sentinelise(v::ChainedVector)
    # @info "In ChainedVector with type $(typeof(v))"
    a = v.arrays
    (eltype(a) <: SentinelVector) || return v
    n = Vector{typeof(a[1].data)}(undef, length(a));
    for i in 1:length(n)
        #= @inbounds =# n[i] = _de_sentinelise(a[i]) # premature optimisation - take the safety instead!
    end;
    return ChainedVector(n)
end

function de_sentinelise!(df::DataFrame)
    for (i,col) in enumerate(eachcol(df))
        df[!,i] = _de_sentinelise(col)
    end
    return nothing
end

function de_sentinelise!(df::DataFrame, cols)
    for col in cols
        df[!,col] = _de_sentinelise(df[!,col])
    end
    return nothing
end

de_sentinelise!(df)

jondea · January 21, 2022, 10:30am

I recently came across the same problem, thank you for the solution @dereksz

However, I managed to simplify it by using collect instead

function de_sentinelise!(df::DataFrame)
    for (i,col) in enumerate(eachcol(df))
        df[!,i] = collect(col)
    end
    return df
end

sijo · January 21, 2022, 1:29pm

I think this is equivalent to mapcols!(collect, df) ?

But I couldn’t find a simple way to make one of these data frames with SentinalArrays columns. It would be nice to have a MWE to produce one.

jondea · January 21, 2022, 4:50pm

I’m not at my computer to check, but that sounds right, and very elegant!

Unfortunately I can’t share the dataset sorry, but it happened for me when the CSV got suitably large (1000s of rows). The columns became chained sentinel arrays, and GLM complained.

bkamins · January 21, 2022, 5:40pm

This will be fixed soon.

In general if you want to avoid SentinelVector (which often makes sense from my experience) set CSV.read to use a single thread when reading the data in (it will then be a bit slower but usually acceptably fast).

jondea · January 21, 2022, 7:06pm

Oh brilliant, thank you

andreasnoack · June 1, 2023, 12:52pm

Was this fixed? If so, I’m curious how it was fixed.

bkamins · June 1, 2023, 1:53pm

AFAICT It was fixed by @andreasnoack in https://github.com/JuliaStats/GLM.jl/pull/446 (and related PRs) , but I have not tested this now.

Topic		Replies	Views
[SOLVED] Usage of CSV.read recently deprecated? General Usage	8	1215	December 2, 2020
Debugger with CSV read General Usage	0	245	January 29, 2021
Error precompiling CSV and SentinelArrays New to Julia package , precompilation	2	196	May 3, 2023
DataFrames: ByRow fails in transform with PooledArrays after CSV.read Data question	6	489	August 6, 2021
DataFrame has NA, what best to do? General Usage	8	637	December 13, 2018

Disable SentinelArrays for CSV.read

Related topics