Sparse DataFrame

alexlenail · July 5, 2021, 6:13pm

DataFrame(sparse_matrix) gives me:

ArgumentError: a 'SparseMatrixCSC{Float64,Int64}' is not a table; see `?Tables.table` for ways to treat an AbstractMatrix as a table

I can’t find documentation about how to load a sparse matrix into a DataFrame, without inflating my matrix (which I’d rather avoid).

nilshg · July 5, 2021, 6:46pm

What do you want your DataFrame to look like?

alexlenail · July 5, 2021, 6:53pm

Zeros where the matrix doesn’t have entries – but without occupying 30GB of RAM

nilshg · July 6, 2021, 2:04am

I was just going to reply that this probably won’t be possible, and asking for your use case (30GB implies quite a lot of rows and/or columns, which if sparse are probably just a pain to handle outside the SparseArrays infrastructure?) but then I thought ah well this is Julia after all so why wouldn’t two packages just work ?

You can think of a DataFrame as a collection of columns, with each column being a (named) standard Julia vector, so constructing a DataFrame from the columns of a SparseArray:

julia> using DataFrames, SparseArrays

julia> I = [1, 4, 3, 10_000_000]; J = [4, 7, 200_000, 9]; V = [1, 2, -5, 6];

julia> S = sparse(I, J, V);

julia> df = DataFrame(["x$i" => c for (i, c) ∈ enumerate(eachcol(S))]...)
10000000×200000 DataFrame
(...)

and just to show that I’m not just using a machine with 2TB of RAM:

julia> rand(0:5, 10_000_000, 200_000)
ERROR: OutOfMemoryError()

sijo · July 6, 2021, 6:22am

Here’s an example wrapping an existing sparse matrix in a data frame:

julia> A = sparse(2I(4))
4×4 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
 2  ⋅  ⋅  ⋅
 ⋅  2  ⋅  ⋅
 ⋅  ⋅  2  ⋅
 ⋅  ⋅  ⋅  2

julia> df = DataFrame([eachcol(A)...], :auto, copycols=false)
4×4 DataFrame
 Row │ x1     x2     x3     x4    
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     2      0      0      0
   2 │     0      2      0      0
   3 │     0      0      2      0
   4 │     0      0      0      2

julia> df.x1
4-element view(::SparseMatrixCSC{Int64, Int64}, :, 1) with eltype Int64:
 2
 0
 0
 0

You can remove copycols=false to have the DataFrame constructor automatically copy each column to a new SparseVector.

bkamins · July 6, 2021, 7:02am

This is exactly the way to do it. The only cost is that some functions defined in DataFrames.jl will not work correctly as they assume that you can resize the vectors.

gdalle · November 3, 2021, 12:51pm

Hi! What do you mean by “not work correctly”: should I expect exceptions or sneaky silent bugs?

bkamins · November 3, 2021, 1:09pm

Exceptions. If you e.g. do push! or append! you will get an exception that the vector is not resizable.

pdeffebach · November 3, 2021, 1:17pm

Cross-posted from Slack, here is one instance where it doesn’t work.

julia> using DataFramesMeta, SparseArrays;

julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     0
   3 │     0
   4 │     0
   5 │     1

julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
  deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
  deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
  deleteat!(::BitVector, ::Any) at bitarray.jl:981
  ...
Stacktrace:
 [1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
 [2] delete!(df::DataFrame, inds::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
 [3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
 [4] top-level scope
   @ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681

So no, it won’t be a sneaky bug.

gdalle · November 3, 2021, 1:39pm

Just referencing the GitHub issue https://github.com/JuliaData/CategoricalArrays.jl/issues/374

Topic		Replies	Views
DataFrame Matrix Constructor without Copying General Usage dataframes	7	600	November 23, 2021
Output sparse matrix to csv General Usage question , csv , sparse	25	2708	September 1, 2021
Can I have vectors in DataFrame cells? New to Julia dataframes	4	2119	April 3, 2021
How to convert an array to a dataframe? General Usage	2	6906	September 25, 2020
How to create data frame from saved vectors in julia 1.7? New to Julia question	4	2574	April 5, 2022

Sparse DataFrame

Related topics