I’ve got two datasets that would look like this in a dataframe (see later for MWE):
julia> df1 = mat2df(mat1, colnames1, id1)
8×5 DataFrame
Row │ id col1 col2 col3 col4
│ String Int64 Int64 Int64 Int64
─────┼────────────────────────────────────
1 │ a 32 49 96 63
2 │ b 55 75 65 23
3 │ c 22 58 100 12
4 │ d 90 73 75 61
5 │ e 35 0 11 67
6 │ f 39 20 49 76
7 │ g 96 44 57 59
8 │ h 80 68 25 36
julia> df2 = mat2df(mat2, colnames2, id2)
4×7 DataFrame
Row │ id col1 col2 col3 col6 col4 col5
│ String Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────────────
1 │ d 1 6 9 5 9 5
2 │ b 10 0 3 3 7 10
3 │ c 6 7 8 2 4 8
4 │ a 7 1 2 10 9 5
I need to update (in various ways depending on the column) only some of the rows/columns of df1 based on the values in df2. The dataframe is so easy to look at and I love the succinct dot notation (ie, df1.col1[idx]
vs df1[idx, "col1"]
vs df1[idx, 2]
), but it is far too slow.
So I tried out AxisArrays, and it allows me to index by name while being ~10x faster. However, there was still no dot notation.
Then I tried StaticArrays, and it was somewhat faster than an AxisArray (afaict, I still needed to use a mutable MArray here), but doesn’t allow any indexing by name. Further, it seems I need to hardcode the dimensions (I can’t pass it as a function argument?)
Ie, compare updateSA vs updateSA2 functions below. The only difference is: MMatrix{sz1[1], sz1[2]}(mat1)
vs MMatrix{8, 4}(mat1)
. The first option is slow as a dataframe.
Here’s the MWE (I’m not worried about the copy(mat1)
, that only mattered for the benchmark):
MWE
using AxisArrays, DataFrames, StaticArrays, Random, BenchmarkTools
function mat2df(mat, colnames, id)
df = DataFrame(mat, colnames, copycols = false)
insertcols!(df, 1, "id" => id, copycols = false)
return df
end
function updateDF(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = mat2df(mat1, colnames1, id1)
dat2 = mat2df(mat2, colnames2, id2)
idxr = indexin(dat2.id, dat1.id)
upcols = @view names(dat1)[[2; 4]]
for i in eachindex(upcols)
dat1[idxr, upcols[i]] += dat2[!, upcols[i]]
end
dat1.col2[idxr] += dat2.col2 .> 0
dat1.col4[idxr] .= dat2.col4
return(dat1)
end
function updateAA(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))
idxr = indexin(id2, id1)
upcols = dat1.axes[2][[1; 3]]
for i in eachindex(upcols)
dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
end
dat1[idxr, "col2"] += @views dat2[:, "col2"] .> 0
dat1[idxr, "col4"] = @view dat2[:, "col4"]
return(dat1)
end
function updateSA(mat1, id1, colnames1, mat2, id2, colnames2)
sz1 = size(mat1)
sz2 = size(mat2)
dat1 = MMatrix{sz1[1], sz1[2]}(mat1)
dat2 = SMatrix{sz2[1], sz2[2]}(mat2)
idxr = indexin(id2, id1)
idxcol2 = indexin(colnames1, colnames2)
upcols1 = [1; 3]
upcols2 = @views idxcol2[upcols1]
for i in eachindex(upcols1)
dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
end
dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
dat1[idxr, 4] = dat2[:, idxcol2[4]]
return dat1
end
function updateSA2(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = MMatrix{8, 4}(mat1)
dat2 = SMatrix{4, 6}(mat2)
idxr = indexin(id2, id1)
idxcol2 = indexin(colnames1, colnames2)
upcols1 = [1; 3]
upcols2 = @views idxcol2[upcols1]
for i in eachindex(upcols1)
dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
end
dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
dat1[idxr, 4] = dat2[:, idxcol2[4]]
return dat1
end
Random.seed!(1234)
mat1 = rand(0:100, 8, 4);
mat2 = rand(0:10, 4, 6);
id1 = string.(collect('a':'h'));
id2 = string.(collect('a':'d')[[4; 2; 3; 1]]);
colnames1 = "col" .* string.(1:4);
colnames2 = "col" .* string.(1:6)[[1:3; 6; 4:5]];
@btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
res1 = updateDF(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res2 = updateAA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res3 = updateSA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res4 = updateSA2(copy(mat1), id1, colnames1, mat2, id2, colnames2);
Matrix(res1[!, 2:end]) == res2 == res3 == res4
Benchmarks:
julia> @btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
8.031 μs (101 allocations: 7.12 KiB)
julia> @btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
1.813 μs (16 allocations: 1.73 KiB)
julia> @btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
8.441 μs (68 allocations: 4.70 KiB)
julia> @btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
1.647 μs (16 allocations: 2.17 KiB)
-
Is there a way to get index, string, and dot notation (like dataframe) with near the speed of StaticArray? Or is this tradeoff just inherent to having multiple indexing methods?
-
Is there some way to pass a variable for the dimensions of a StaticArray without slowing it down 10x?
-
Is MMatrix the right solution for the problem in the MWE?
Thanks for any help.