DataFrames deleterows! broken?

I’m still trying to figure out a MWE. I’m encountering a weird issue as follows where df is a DataFrame and d is Vector{Int64}. Any idea?

julia> deleterows!(df, d)
ERROR: MethodError: no method matching deleteat!(::Base.ReshapedArray{Union{Missing, Float64},1,Array{Union{Missing, Float64},2},Tuple{}}, ::Array{Int64,1})
Closest candidates are:
  deleteat!(::Array{T,1} where T, ::AbstractArray{T,1} where T) at array.jl:1213
  deleteat!(::Array{T,1} where T, ::Any) at array.jl:1212
  deleteat!(::BitArray{1}, ::Any) at bitarray.jl:940
  ...
Stacktrace:
 [1] (::getfield(DataFrames, Symbol("##72#73")){Array{Int64,1}})(::Base.ReshapedArray{Union{Missing, Float64},1,Array{Union{Missing, Float64},2},Tuple{}}) at /home/tkwong/.julia/packages/DataFrames/IKMvt/src/dataframe/dataframe.jl:871
 [2] foreach(::getfield(DataFrames, Symbol("##72#73")){Array{Int64,1}}, ::Array{AbstractArray{T,1} where T,1}) at ./abstractarray.jl:1866
 [3] deleterows!(::DataFrame, ::Array{Int64,1}) at /home/tkwong/.julia/packages/DataFrames/IKMvt/src/dataframe/dataframe.jl:871
 [4] top-level scope at none:0

There’s nothing fancy about the DataFrame:

julia> describe(df)
17×8 DataFrame
│ Row │ variable │ mean        │ min          │ median       │ max        │ nunique │ nmissing │ eltype   │
│     │ String   │ Union…      │ Any          │ Union…       │ Any        │ Union…  │ Union…   │ DataType │
├─────┼──────────┼─────────────┼──────────────┼──────────────┼────────────┼─────────┼──────────┼──────────┤
│ 1   │ var1     │ 12.9394     │ 0.112803     │ 4.57057      │ 43.3651    │         │          │ Float64  │
│ 2   │ var2     │ 6.62466     │ -2.28851     │ 1.00144      │ 35.3775    │         │ 0        │ Float64  │
│ 3   │ var3     │             │ abc          │              │ xyz        │ 14      │ 0        │ String   │
│ 4   │ var4     │ 0.0862132   │ -0.0201804   │ 0.00865615   │ 0.502727   │         │ 0        │ Float64  │
│ 5   │ var5     │ 2.52696     │ -5.56837     │ 0.280422     │ 20.2763    │         │ 0        │ Float64  │
│ 6   │ var6     │ 2.38489     │ -3.1457      │ 0.254247     │ 16.6263    │         │ 0        │ Float64  │
│ 7   │ var7     │ -0.0034382  │ -0.0190981   │ -0.000636895 │ 0.00132188 │         │ 0        │ Float64  │
│ 8   │ var8     │ 0.0945996   │ -0.0         │ 0.0          │ 1.32439    │         │ 0        │ Float64  │
│ 9   │ var9     │ 0.0568503   │ -0.127484    │ 0.00637622   │ 0.370084   │         │ 0        │ Float64  │
│ 10  │ var10    │ 0.42912     │ -1.01617     │ 0.0407236    │ 2.7647     │         │ 0        │ Float64  │
│ 11  │ var11    │ -9.80372e-5 │ -0.00307758  │ -7.38922e-6  │ 0.00319956 │         │ 0        │ Float64  │
│ 12  │ var12    │ 0.00869214  │ -0.000195758 │ 0.00181751   │ 0.0535332  │         │ 0        │ Float64  │
│ 13  │ var13    │ 0.0357346   │ -0.00427665  │ 0.00455445   │ 0.190761   │         │ 0        │ Float64  │
│ 14  │ var14    │ 0.192752    │ -0.295851    │ 0.0143688    │ 0.794363   │         │ 0        │ Float64  │
│ 15  │ var15    │ 0.0         │ -0.0         │ 0.0          │ 0.0        │         │ 0        │ Float64  │
│ 16  │ var16    │ -0.163465   │ -8.84644     │ -0.0122285   │ 16.7318    │         │ 0        │ Float64  │
│ 17  │ var17    │ 0.975848    │ -8.84644     │ 0.0591109    │ 14.6835    │         │ 0        │ Float64  │

and d = [12]

The problem seems to be that you store non standard vectors in the data frame. More precisely the vectors cannot be resized in-place and you try to perform such an operation.

  1. Can you post the output of: foreach(col -> println(col[1], ":\t", typeof(col[2])), eachcol(df, true)) run on it?
  2. Something like this should work df[.!in.(axes(df, 1), (d,)), :] or df[setdiff(axes(df, 1), d), :]

Looks like var2 may be the issue?

julia> foreach(col -> println(col[1], ":\t", typeof(col[2])), eachcol(df, true))
var1:   Array{Float64,1}
var2:   Base.ReshapedArray{Union{Missing, Float64},1,Array{Union{Missing, Float64},2},Tuple{}}
var3:   Array{Union{Missing, String},1}
var4:   Array{Union{Missing, Float64},1}
var5:   Array{Union{Missing, Float64},1}
var6:   Array{Union{Missing, Float64},1}
var7:   Array{Union{Missing, Float64},1}
var8:   Array{Union{Missing, Float64},1}
var9:   Array{Union{Missing, Float64},1}
var10:  Array{Union{Missing, Float64},1}
var11:  Array{Union{Missing, Float64},1}
var12:  Array{Union{Missing, Float64},1}
var13:  Array{Union{Missing, Float64},1}
var14:  Array{Union{Missing, Float64},1}
var15:  Array{Union{Missing, Float64},1}
var16:  Array{Union{Missing, Float64},1}
var17:  Array{Union{Missing, Float64},1}

Yes :var2 may not be resized. Where did you get this column from. Write df.var2 = collect(df.var2) or df.var2 = df.var2[:] and this should be fixed (or use the other methods I gave above)

It’s a bit complicated and not really my code. Let me try my best to describe it.

  1. Start with a 14x15 data frame (call it d1) where the first column is string and the rest are Union{Missing,Float64}'s.
  2. d2 = convert(Matrix, d1[2:end]) so d2 is a 14x14 matrix of type Array{Union{Missing, Float64},2}
  3. d3 = sum(d2, dimes = 2) so d3 is a 14-element array like this: Union{Missing, Float64}[1.20698; 35.3775; 33.3884...]

Here comes the interesting part. The idea is to insert d3 as a column to another data frame e.g. insertcols!(df, 1, x = d3). But, if I do that I end up with a different exception:

ERROR: ArgumentError: setindex!(::DataFrame, ...) only broadcasts scalars, not arrays
Stacktrace:
 [1] upgrade_scalar(::DataFrame, ::Array{Union{Missing, Float64},2}) at /home/tkwong/.julia/packages/DataFrames/IKMvt/src/dataframe/dataframe.jl:411
 [2] #insertcols!#70(::Bool, ::Function, ::DataFrame, ::Int64, ::Pair{Symbol,Array{Union{Missing, Float64},2}}) at /home/tkwong/.julia/packages/DataFrames/IKMvt/src/dataframe/dataframe.jl:752
 [3] (::getfield(DataFrames, Symbol("#kw##insertcols!")))(::NamedTuple{(:makeunique,),Tuple{Bool}}, ::typeof(insertcols!), ::DataFrame, ::Int64, ::Pair{Symbol,Array{Union{Missing, Float64},2}}) at ./none:0
 [4] #insertcols!#71(::Bool, ::Base.Iterators.Pairs{Symbol,Array{Union{Missing, Float64},2},Tuple{Symbol},NamedTuple{(:CTESD,),Tuple{Array{Union{Missing, Float64},2}}}}, ::Function, ::DataFrame, ::Int64) at /home/tkwong/.julia/packages/DataFrames/IKMvt/src/dataframe/dataframe.jl:757

So that may be why the original coder do another step after step 3 above:
4. d4 = vec(d3)

and then we got the exception in the original post.

probably it would be better to dropdims than vec (but on Julia 1.1 vec should produce a vector not reshaped vector so this is a bit strange). Anyway - d3 is a matrix not a vector so it will not fit into the DataFrame and you have to convert it to a Vector if you later want to resize it.

Thanks @bkamins, I was able to work around the original problem with a collect. I am just thinking about how to do it properly.

I think collect is OK here

1 Like

Is this really solution?

This looks like overcomplicated woodoo programming style in language trying to be simple as python.

Isn’t real problem in vec function or missing method for deleteat!?

Isn’t real problem in vec

vec works as expected - it transforms a matrix to a vector.

missing method for deleteat!

I think deleteat! should not modify views so this is also working as expected.

If you look at the documentation you see that:

  • vec has a contract to return an AbstractVector
  • collect has a contract to return an Array

In the problem described here collect is used to change AbstractVector into Vector and vec will not do it (it does not transform AbstractVectors but leaves them as is). On the other hand collect exactly does what is needed - takes some vector (a view) that already is present in the data frame and transforms it to Vector.

However, I agree that in the first place it would be better to call vec(d3) or dropdims(d3, dims=2) (the latter is safer because it will catch an error if d3 does not have exactly one column) before putting the column into the data frame and there would be no problem in the first place.

1 Like

Just to add. In many situations Vector instead of collect will work, but unfortunately not always. E.g.:

julia> Vector((i for i in 1:10))
ERROR: MethodError: no method matching Array{T,1} where   ...
Stacktrace:
 [1] top-level scope at none:0

julia> collect((i for i in 1:10))
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

Also regarding the earlier step you could have written sum.(eachrow(d1[2:end])) to get a vector in the first place (this might be slower for huge data frames, but in normal cases it will work).