Disable SentinelArrays for CSV.read

I’m reading a large CSV file of many Float64 values and then applying many possible transformations before doing a least squares fit.

I noticed that the type is not plain Float64s despite there being no missing values:

julia> d = CSV.read("/data/m4.csv", NamedTuple)
julia> typeof(d)
NamedTuple{(:hu, :hc, :hf, :hs, :bu, :bs, :cs), NTuple{7, SentinelArrays.ChainedVector{Float64, Vector{Float64}}}}

If I manually convert to plain Float64s (btw is there an easier way to do this?), I get a 28% speed up in my transformations code:

julia> typeof((bs=d.bs[:], bu=d.bu[:], cs=d.cs[:], hc=d.hc[:], hf=d.hf[:], hs=d.hs[:], hu=d.hu[:]))
NamedTuple{(:bs, :bu, :cs, :hc, :hf, :hs, :hu), NTuple{7, Vector{Float64}}}

Is there any way to tell CSV.read to not use SentinelArrays so I don’t have to do the conversion myself?

EDIT:
As a workaround, I’m using this:

d = JuliaDB.loadtable("/data/m4.csv").columns.columns

I hit the same issue, and have developed a very efficient - but also highly fragile - solution.

If you look to the layout of SentinelArray, there is a data attribute that holds the underlying vector, so I simply do fd[!,col] = fd[!,col].data. However, if your CSV is running multithreaded, you actually end up with a set of SentinelArray packaged within a ChainedArray. To address both situations it’s easiest to use functions multiple-dispatch to make sure the “de_sentinelization” happens correctly:

# _de_sentinelise digs into Sentinel & Chained Vectors and "pulls up" the Vector under the Sentinel
function _de_sentinelise(v::AbstractVector)
    # @info "In AbstractVector with type $(typeof(v))"
    return v
end

function _de_sentinelise(v::SentinelVector{T}) where {T<:Base.AbstractFloat}
    # @info "In SentinelArray with type $(typeof(v))"
    return v.data
end

function _de_sentinelise(v::ChainedVector)
    # @info "In ChainedVector with type $(typeof(v))"
    a = v.arrays
    (eltype(a) <: SentinelVector) || return v
    n = Vector{typeof(a[1].data)}(undef, length(a));
    for i in 1:length(n)
        #= @inbounds =# n[i] = _de_sentinelise(a[i]) # premature optimisation - take the safety instead!
    end;
    return ChainedVector(n)
end

function de_sentinelise!(df::DataFrame)
    for (i,col) in enumerate(eachcol(df))
        df[!,i] = _de_sentinelise(col)
    end
    return nothing
end

function de_sentinelise!(df::DataFrame, cols)
    for col in cols
        df[!,col] = _de_sentinelise(df[!,col])
    end
    return nothing
end

de_sentinelise!(df)
2 Likes

I recently came across the same problem, thank you for the solution @dereksz

However, I managed to simplify it by using collect instead

function de_sentinelise!(df::DataFrame)
    for (i,col) in enumerate(eachcol(df))
        df[!,i] = collect(col)
    end
    return df
end
1 Like

I think this is equivalent to mapcols!(collect, df) ?

But I couldn’t find a simple way to make one of these data frames with SentinalArrays columns. It would be nice to have a MWE to produce one.

2 Likes

I’m not at my computer to check, but that sounds right, and very elegant!

Unfortunately I can’t share the dataset sorry, but it happened for me when the CSV got suitably large (1000s of rows). The columns became chained sentinel arrays, and GLM complained.

This will be fixed soon.

In general if you want to avoid SentinelVector (which often makes sense from my experience) set CSV.read to use a single thread when reading the data in (it will then be a bit slower but usually acceptably fast).

2 Likes

Oh brilliant, thank you :smile:

Was this fixed? If so, I’m curious how it was fixed.

AFAICT It was fixed by @andreasnoack in https://github.com/JuliaStats/GLM.jl/pull/446 (and related PRs) :smiley:, but I have not tested this now.