Sparse DataFrame

DataFrame(sparse_matrix) gives me:

ArgumentError: a 'SparseMatrixCSC{Float64,Int64}' is not a table; see `?Tables.table` for ways to treat an AbstractMatrix as a table

I can’t find documentation about how to load a sparse matrix into a DataFrame, without inflating my matrix (which I’d rather avoid).

What do you want your DataFrame to look like?

Zeros where the matrix doesn’t have entries – but without occupying 30GB of RAM

1 Like

I was just going to reply that this probably won’t be possible, and asking for your use case (30GB implies quite a lot of rows and/or columns, which if sparse are probably just a pain to handle outside the SparseArrays infrastructure?) but then I thought ah well this is Julia after all so why wouldn’t two packages just work :tm: ?

You can think of a DataFrame as a collection of columns, with each column being a (named) standard Julia vector, so constructing a DataFrame from the columns of a SparseArray:

julia> using DataFrames, SparseArrays

julia> I = [1, 4, 3, 10_000_000]; J = [4, 7, 200_000, 9]; V = [1, 2, -5, 6];

julia> S = sparse(I, J, V);

julia> df = DataFrame(["x$i" => c for (i, c) ∈ enumerate(eachcol(S))]...)
10000000×200000 DataFrame
(...)

and just to show that I’m not just using a machine with 2TB of RAM:

julia> rand(0:5, 10_000_000, 200_000)
ERROR: OutOfMemoryError()
6 Likes

Here’s an example wrapping an existing sparse matrix in a data frame:

julia> A = sparse(2I(4))
4×4 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
 2  ⋅  ⋅  ⋅
 ⋅  2  ⋅  ⋅
 ⋅  ⋅  2  ⋅
 ⋅  ⋅  ⋅  2

julia> df = DataFrame([eachcol(A)...], :auto, copycols=false)
4×4 DataFrame
 Row │ x1     x2     x3     x4    
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     2      0      0      0
   2 │     0      2      0      0
   3 │     0      0      2      0
   4 │     0      0      0      2

julia> df.x1
4-element view(::SparseMatrixCSC{Int64, Int64}, :, 1) with eltype Int64:
 2
 0
 0
 0

You can remove copycols=false to have the DataFrame constructor automatically copy each column to a new SparseVector.

2 Likes

This is exactly the way to do it. The only cost is that some functions defined in DataFrames.jl will not work correctly as they assume that you can resize the vectors.

2 Likes

Hi! What do you mean by “not work correctly”: should I expect exceptions or sneaky silent bugs?

Exceptions. If you e.g. do push! or append! you will get an exception that the vector is not resizable.

1 Like

Cross-posted from Slack, here is one instance where it doesn’t work.

julia> using DataFramesMeta, SparseArrays;

julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     0
   3 │     0
   4 │     0
   5 │     1

julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
  deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
  deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
  deleteat!(::BitVector, ::Any) at bitarray.jl:981
  ...
Stacktrace:
 [1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
 [2] delete!(df::DataFrame, inds::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
 [3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
 [4] top-level scope
   @ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681

So no, it won’t be a sneaky bug.

1 Like

Just referencing the GitHub issue https://github.com/JuliaData/CategoricalArrays.jl/issues/374