I’ve got two datasets that would look like this in a dataframe (see later for MWE):
julia> df1 = mat2df(mat1, colnames1, id1)
8×5 DataFrame
Row │ id col1 col2 col3 col4
│ String Int64 Int64 Int64 Int64
1 │ a 32 49 96 63
2 │ b 55 75 65 23
3 │ c 22 58 100 12
4 │ d 90 73 75 61
5 │ e 35 0 11 67
6 │ f 39 20 49 76
7 │ g 96 44 57 59
8 │ h 80 68 25 36
julia> df2 = mat2df(mat2, colnames2, id2)
4×7 DataFrame
Row │ id col1 col2 col3 col6 col4 col5
│ String Int64 Int64 Int64 Int64 Int64 Int64
1 │ d 1 6 9 5 9 5
2 │ b 10 0 3 3 7 10
3 │ c 6 7 8 2 4 8
4 │ a 7 1 2 10 9 5
I need to update (in various ways depending on the column) only some of the rows/columns of df1 based on the values in df2. The dataframe is so easy to look at and I love the succinct dot notation (ie, df1.col1[idx]
vs df1[idx, "col1"]
vs df1[idx, 2]
), but it is far too slow.
So I tried out AxisArrays, and it allows me to index by name while being ~10x faster. However, there was still no dot notation.
Then I tried StaticArrays, and it was somewhat faster than an AxisArray (afaict, I still needed to use a mutable MArray here), but doesn’t allow any indexing by name. Further, it seems I need to hardcode the dimensions (I can’t pass it as a function argument?)
Ie, compare updateSA vs updateSA2 functions below. The only difference is: MMatrix{sz1[1], sz1[2]}(mat1)
vs MMatrix{8, 4}(mat1)
. The first option is slow as a dataframe.
Here’s the MWE (I’m not worried about the copy(mat1)
, that only mattered for the benchmark):
using AxisArrays, DataFrames, StaticArrays, Random, BenchmarkTools
function mat2df(mat, colnames, id)
df = DataFrame(mat, colnames, copycols = false)
insertcols!(df, 1, "id" => id, copycols = false)
return df
function updateDF(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = mat2df(mat1, colnames1, id1)
dat2 = mat2df(mat2, colnames2, id2)
idxr = indexin(dat2.id, dat1.id)
upcols = @view names(dat1)[[2; 4]]
for i in eachindex(upcols)
dat1[idxr, upcols[i]] += dat2[!, upcols[i]]
dat1.col2[idxr] += dat2.col2 .> 0
dat1.col4[idxr] .= dat2.col4
function updateAA(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))
idxr = indexin(id2, id1)
upcols = dat1.axes[2][[1; 3]]
for i in eachindex(upcols)
dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
dat1[idxr, "col2"] += @views dat2[:, "col2"] .> 0
dat1[idxr, "col4"] = @view dat2[:, "col4"]
function updateSA(mat1, id1, colnames1, mat2, id2, colnames2)
sz1 = size(mat1)
sz2 = size(mat2)
dat1 = MMatrix{sz1[1], sz1[2]}(mat1)
dat2 = SMatrix{sz2[1], sz2[2]}(mat2)
idxr = indexin(id2, id1)
idxcol2 = indexin(colnames1, colnames2)
upcols1 = [1; 3]
upcols2 = @views idxcol2[upcols1]
for i in eachindex(upcols1)
dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
dat1[idxr, 4] = dat2[:, idxcol2[4]]
return dat1
function updateSA2(mat1, id1, colnames1, mat2, id2, colnames2)
dat1 = MMatrix{8, 4}(mat1)
dat2 = SMatrix{4, 6}(mat2)
idxr = indexin(id2, id1)
idxcol2 = indexin(colnames1, colnames2)
upcols1 = [1; 3]
upcols2 = @views idxcol2[upcols1]
for i in eachindex(upcols1)
dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
dat1[idxr, 4] = dat2[:, idxcol2[4]]
return dat1
mat1 = rand(0:100, 8, 4);
mat2 = rand(0:10, 4, 6);
id1 = string.(collect('a':'h'));
id2 = string.(collect('a':'d')[[4; 2; 3; 1]]);
colnames1 = "col" .* string.(1:4);
colnames2 = "col" .* string.(1:6)[[1:3; 6; 4:5]];
@btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
res1 = updateDF(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res2 = updateAA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res3 = updateSA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res4 = updateSA2(copy(mat1), id1, colnames1, mat2, id2, colnames2);
Matrix(res1[!, 2:end]) == res2 == res3 == res4
julia> @btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
8.031 μs (101 allocations: 7.12 KiB)
julia> @btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
1.813 μs (16 allocations: 1.73 KiB)
julia> @btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
8.441 μs (68 allocations: 4.70 KiB)
julia> @btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
1.647 μs (16 allocations: 2.17 KiB)
Is there a way to get index, string, and dot notation (like dataframe) with near the speed of StaticArray? Or is this tradeoff just inherent to having multiple indexing methods?
Is there some way to pass a variable for the dimensions of a StaticArray without slowing it down 10x?
Is MMatrix the right solution for the problem in the MWE?
Thanks for any help.