DataFrame join error

I have

julia> timeuse = join(people[:, [:idhh, :idpers, :gender]], timeuse;
       on = [:idhh, :idpers], kind = :inner)
ERROR: DimensionMismatch("destination must have length equal to sums of concatenated vectors")
Stacktrace:
 [1] vcat_copyto!(::Array{Int64,1}, ::SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false}, ::Vararg{SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false},N} where N) at /home/tamas/.julia/packages/LazyArrays/14GOk/src/lazyconcat.jl:158
 [2] copyto!(::Array{Int64,1}, ::SubArray{Int64,1,LazyArrays.ApplyArray{Int64,1,typeof(vcat),NTuple{8,Array{Int64,1}}},Tuple{Array{Int64,1}},false}) at /home/tamas/.julia/packages/LazyArrays/14GOk/src/lazyconcat.jl:579
 [3] #compose_joined_table#286(::Bool, ::typeof(DataFrames.compose_joined_table), ::DataFrames.DataFrameJoiner{DataFrame,DataFrame}, ::Symbol, ::DataFrames.RowIndexMap, ::DataFrames.RowIndexMap, ::DataFrames.RowIndexMap, ::DataFrames.RowIndexMap) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/abstractdataframe/join.jl:106
 [4] #compose_joined_table at ./none:0 [inlined]
 [5] #join#294(::Array{Symbol,1}, ::Symbol, ::Bool, ::Nothing, ::Tuple{Bool,Bool}, ::typeof(join), ::DataFrame, ::DataFrame) at /home/tamas/.julia/packages/DataFrames/uPgZV/src/abstractdataframe/join.jl:364
 [6] (::Base.var"#kw##join")(::NamedTuple{(:on, :kind),Tuple{Array{Symbol,1},Symbol}}, ::typeof(join), ::DataFrame, ::DataFrame) at ./none:0
 [7] top-level scope at REPL[136]:1

while

timeuse = join(people[1:size(people, 1), [:idhh, :idpers, :gender]], timeuse;
               on = [:idhh, :idpers], kind = :inner)

works fine (the only difference is : vs 1:size(people, 1)). I have not been able to produce an MWE, and I am sorry but I am not able to share the data (it is confidential). The : version worked fine in 0.19.

I did not think we have changed this between 0.19 and 0.20. But the reason is that df[:, cols] uses copy to copy columns, while df[rows, cols] uses vector subsetting, i.e. for each column c it does c[rows].

For columns being a Vector this should be equivalent. For custom column types this might not be equivalent. For instance a reason might be that you use https://github.com/JuliaArrays/LazyArrays.jl from CSV.jl (this is a known bug that should be fixed in CSV.jl, see https://github.com/JuliaData/CSV.jl/issues/539). Is this the reason?

1 Like

Thanks. I was using CSV.File from the beginning. Indeed

 DataFrame(CSV.File("some_path.csv"); threaded = false); copycols = true)

fixes the issue.

What is more worrying is that unless I apply both options above, join results in data corruption (for some colums, values are exchanged for the same join key in random order). I tried making an MWE, but for some reason it only kicks in if read the data from CSV, and it is large enough.

3 Likes

It seems that the problem is with how arrays from LazyArrays.jl work, probably within compose_joined_table. The code in this part has not been touched for over 2 years, so something might get off sync. @quinnj will probably fix CSV.jl soon, so this should not surface in common usage, but it would be great to have some MWE as probably either LazyArrays.jl or DataFrames.jl need fixing in general. Thank you for reporting this.

@Tamas_Papp - if you are unable to share the file (or some other file that reproduces the error; I have tried to create one and was unable). Then can you please put some value tracing code before line https://github.com/JuliaData/DataFrames.jl/blob/master/src/abstractdataframe/join.jl#L106.
To make sure what is the type and size of cols[i] and what is the type, size and contents of view(col, all_orig_left_ixs) and if they are the same as should be expected. Thank you!

I also can’t reproduce @Tamas_Papp’s issue. My guess would be that it only occurs for a very specific array type. Could you give what version of LazyArrays you are using?

It would be nice if we could split off the conversation about this error to another thread.

1 Like

Thanks for moving the topic, I will dig into this.