Performance vs notation tradeoff: DataFrame vs AxisArray vs StaticArray vs?

Tetrakai · August 17, 2024, 10:33pm

I’ve got two datasets that would look like this in a dataframe (see later for MWE):

julia> df1 = mat2df(mat1, colnames1, id1)
8×5 DataFrame
 Row │ id      col1   col2   col3   col4  
     │ String  Int64  Int64  Int64  Int64 
─────┼────────────────────────────────────
   1 │ a          32     49     96     63
   2 │ b          55     75     65     23
   3 │ c          22     58    100     12
   4 │ d          90     73     75     61
   5 │ e          35      0     11     67
   6 │ f          39     20     49     76
   7 │ g          96     44     57     59
   8 │ h          80     68     25     36

julia> df2 = mat2df(mat2, colnames2, id2)
4×7 DataFrame
 Row │ id      col1   col2   col3   col6   col4   col5  
     │ String  Int64  Int64  Int64  Int64  Int64  Int64 
─────┼──────────────────────────────────────────────────
   1 │ d           1      6      9      5      9      5
   2 │ b          10      0      3      3      7     10
   3 │ c           6      7      8      2      4      8
   4 │ a           7      1      2     10      9      5

I need to update (in various ways depending on the column) only some of the rows/columns of df1 based on the values in df2. The dataframe is so easy to look at and I love the succinct dot notation (ie, df1.col1[idx] vs df1[idx, "col1"] vs df1[idx, 2]), but it is far too slow.

So I tried out AxisArrays, and it allows me to index by name while being ~10x faster. However, there was still no dot notation.

Then I tried StaticArrays, and it was somewhat faster than an AxisArray (afaict, I still needed to use a mutable MArray here), but doesn’t allow any indexing by name. Further, it seems I need to hardcode the dimensions (I can’t pass it as a function argument?)

Ie, compare updateSA vs updateSA2 functions below. The only difference is: MMatrix{sz1[1], sz1[2]}(mat1) vs MMatrix{8, 4}(mat1). The first option is slow as a dataframe.

Here’s the MWE (I’m not worried about the copy(mat1), that only mattered for the benchmark):

MWE

using AxisArrays, DataFrames, StaticArrays, Random, BenchmarkTools

function mat2df(mat, colnames, id)
    df = DataFrame(mat, colnames, copycols = false)
    insertcols!(df, 1, "id" => id, copycols = false)

    return df
end


function updateDF(mat1, id1, colnames1, mat2, id2, colnames2)

    dat1 = mat2df(mat1, colnames1, id1)
    dat2 = mat2df(mat2, colnames2, id2)

    idxr   = indexin(dat2.id, dat1.id)
    upcols = @view names(dat1)[[2; 4]]

    for i in eachindex(upcols)
        dat1[idxr, upcols[i]] += dat2[!, upcols[i]]
    end

    dat1.col2[idxr] += dat2.col2 .> 0
    dat1.col4[idxr] .= dat2.col4

    return(dat1)
end


function updateAA(mat1, id1, colnames1, mat2, id2, colnames2)

    dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
    dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))

    idxr   = indexin(id2, id1)
    upcols = dat1.axes[2][[1; 3]]

    for i in eachindex(upcols)
        dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
    end

    dat1[idxr, "col2"] += @views dat2[:, "col2"] .> 0
    dat1[idxr, "col4"] = @view dat2[:, "col4"]

    return(dat1)
end


function updateSA(mat1, id1, colnames1, mat2, id2, colnames2)
    sz1 = size(mat1)
    sz2 = size(mat2)
    dat1 = MMatrix{sz1[1], sz1[2]}(mat1)
    dat2 = SMatrix{sz2[1], sz2[2]}(mat2)

    idxr    = indexin(id2, id1)
    idxcol2 = indexin(colnames1, colnames2)
    upcols1 = [1; 3]
    upcols2 = @views idxcol2[upcols1]

    for i in eachindex(upcols1)
        dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
    end

    dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
    dat1[idxr, 4] = dat2[:, idxcol2[4]]

    return dat1
end


function updateSA2(mat1, id1, colnames1, mat2, id2, colnames2)

    dat1 = MMatrix{8, 4}(mat1)
    dat2 = SMatrix{4, 6}(mat2)

    idxr    = indexin(id2, id1)
    idxcol2 = indexin(colnames1, colnames2)
    upcols1 = [1; 3]
    upcols2 = @views idxcol2[upcols1]

    for i in eachindex(upcols1)
        dat1[idxr, upcols1[i]] += dat2[:, upcols2[i]]
    end

    dat1[idxr, 2] += dat2[:, idxcol2[2]] .> 0
    dat1[idxr, 4] = dat2[:, idxcol2[4]]

    return dat1
end



Random.seed!(1234)
mat1 = rand(0:100, 8, 4);
mat2 = rand(0:10, 4, 6);

id1 = string.(collect('a':'h'));
id2 = string.(collect('a':'d')[[4; 2; 3; 1]]);

colnames1 = "col" .* string.(1:4);
colnames2 = "col" .* string.(1:6)[[1:3; 6; 4:5]];


@btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
@btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);


res1 = updateDF(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res2 = updateAA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res3 = updateSA(copy(mat1), id1, colnames1, mat2, id2, colnames2);
res4 = updateSA2(copy(mat1), id1, colnames1, mat2, id2, colnames2);
Matrix(res1[!, 2:end]) == res2 == res3 == res4

Benchmarks:

julia> @btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  8.031 μs (101 allocations: 7.12 KiB)

julia> @btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.813 μs (16 allocations: 1.73 KiB)

julia> @btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  8.441 μs (68 allocations: 4.70 KiB)

julia> @btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.647 μs (16 allocations: 2.17 KiB)

Is there a way to get index, string, and dot notation (like dataframe) with near the speed of StaticArray? Or is this tradeoff just inherent to having multiple indexing methods?
Is there some way to pass a variable for the dimensions of a StaticArray without slowing it down 10x?
Is MMatrix the right solution for the problem in the MWE?

Thanks for any help.

eldee · August 19, 2024, 9:37am

For reference, let me first put the timings I get for your MWE here:

julia> @btime updateDF(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  5.683 μs (101 allocations: 7.12 KiB)

julia> @btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.150 μs (16 allocations: 1.73 KiB)

julia> @btime updateSA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  5.683 μs (68 allocations: 4.70 KiB)

julia> @btime updateSA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.100 μs (16 allocations: 2.17 KiB)

I don’t see any theoretical reason why not. In the end this is just syntax. If you’re okay with dat1."col1" and type piracy, you could use

function Base.getproperty(am::AxisMatrix, s::String)
    return @view am[:, s]  # EDIT: added @view
end

function updateAA2(mat1, id1, colnames1, mat2, id2, colnames2)

    dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
    dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))

    idxr   = indexin(id2, id1)
    upcols = dat1.axes[2][[1; 3]]

    for i in eachindex(upcols)
        dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
    end

    dat1."col2"[idxr] += dat2."col2" .> 0
    dat1."col4"[idxr] = dat2."col4"

    return(dat1)
end

@btime updateAA2(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
#  1.390 μs (25 allocations: 2.16 KiB)
#  ( 1.200 μs (21 allocations: 2.30 KiB) without the `@view` in `getproperty` but yields incorrect results )

which has a ~~minimal~~ small performance impact. Be warned that type piracy should be avoided, though. If this is just for a personal project, and you’re fine with the potential (future) risks, then this should be okay. But if e.g. in a newer version AxisArrays would implement getproperty(..., ::String) itself, your code and AxisArrays itself would likely break. A safer option would be to create your own wrapper struct, but of course this requires more effort on your part.

[ EDIT: The AxisArray documentation states that “Indexing returns a view into the original data.”, but that does not actually seem to be the case. So we explicitly need to add views. ]

Below, I give some ways to get rid of the quotes, i.e. just write dat1.col2, but this gets a bit more complicated.

dat1.col2 without quotes

You could use the same approach of implementing getproperty, but with a Symbol to use dat1.col2 instead of dat1."col2".

getproperty, without metaprogramming

Without metaprogramming, I could not get to the same level of performance:

function Base.getproperty(am::AxisMatrix, symb::Symbol)
    if symb === :data
        return getfield(am, :data)
    elseif symb === :axes
        return getfield(am, :axes)
    else
        return @view am[:, string(symb)]   # EDIT: needs @view for the same reason as above
        # You could also first check if string(symb) in getfield(am, :axes)[2].val
        # and return getfield(am, symb) if not. This will be a bit slower, though.
    end
end

function updateAA3(mat1, id1, colnames1, mat2, id2, colnames2)
    dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
    dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))

    idxr   = indexin(id2, id1)
    upcols = dat1.axes[2][[1; 3]]

    for i in eachindex(upcols)
        dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
    end

    dat1.col2[idxr] += dat2.col2 .> 0
    dat1.col4[idxr] = dat2.col4

    return(dat1)
end

@btime updateAA3(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  5.367 μs (81 allocations: 3.79 KiB)

Note this is even more dangerous type piracy, as we are now explicitly relying on an AxisMatrix having precisely the fields data and axes. A slightly safer option would be to use

function Base.getproperty(am::AxisMatrix, symb::Symbol)
    str_symb = string(symb)    
    if str_symb in getfield(am, :axes)[2].val
        return @view am[:, str_symb]  # (EDIT: @view)
    else
        return getfield(am, symb) 
    end
end

@btime updateAA(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
#  16.700 μs (218 allocations: 7.36 KiB)

@btime updateAA3(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
#  19.900 μs (269 allocations: 8.76 KiB)

but this will slow down every getproperty call, also the normal ones (e.g. am.data).

The problem in the first approach seems to be that the conversion from a Symbol to a String takes too long.

getproperty, with metaprogramming

We can basically cache the string(:col1) etc. results. As long as you don’t have too many columns, and use them sufficiently often (to offset the longer compilation time), this should be fine. Though metaprogramming also comes with a bit of a warning. But in contrast to type piracy, there’s definitely a time and place for metaprogramming (e.g. AxisArrays relies on it).

function Base.getproperty(am::AxisMatrix, symb::Symbol)
    return Base.getproperty(am, Val(symb))
end

Base.getproperty(am::AxisMatrix, ::Val{:axes}) = getfield(am, :axes)
Base.getproperty(am::AxisMatrix, ::Val{:data}) = getfield(am, :data)

@generated function Base.getproperty(am::AxisMatrix, ::Val{symb}) where symb
    symb_str = string(symb)
    return :(view(am, :, $symb_str))  # (EDIT: view)
end

@btime updateAA3(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.340 μs (25 allocations: 2.16 KiB)

Using a macro

Since we basically just want to rewrite dat1.col2[4] to dat1.col2[4, "col2"], we can use a macro for this. In this manner, we won’t have to commit any type piracy.

function di!(ex)  # di from dot indexing
    if (ex.head === :ref
        && ex.args[1].head === :.
        && ex.args[1].args[2] isa QuoteNode
        && ex.args[1].args[2].value isa Symbol)
        # am.col1[4] -> am[4, "col1"]

        symbol_string = string(ex.args[1].args[2].value)
        matrix_symbol = ex.args[1].args[1]

        ex.args[1] = matrix_symbol
        push!(ex.args, symbol_string)

    elseif (ex.head === :.
        && ex.args[1] isa Symbol
        && ex.args[2] isa QuoteNode
        && ex.args[2].value isa Symbol)
        # am.col1 -> am[:, "col1"]

        symbol_string = string(ex.args[2].value)

        ex.head = :ref
        ex.args[2] = Symbol(":")
        push!(ex.args, symbol_string)
    end
end

macro di(ex)
    di!(ex)
    return esc(ex)
end

function updateAA4(mat1, id1, colnames1, mat2, id2, colnames2)

    dat1 = AxisArray(mat1, Axis{:row}(eachindex(id1)), Axis{:col}(colnames1))
    dat2 = AxisArray(mat2, Axis{:row}(eachindex(id2)), Axis{:col}(colnames2))

    idxr   = indexin(id2, id1)
    upcols = dat1.axes[2][[1; 3]]

    for i in eachindex(upcols)
        dat1[idxr, upcols[i]] += @view dat2[:, upcols[i]]
    end

    @di(dat1.col2[idxr]) += @di(dat2.col2) .> 0
    @di(dat1.col4[idxr]) = @di(dat2.col4)

    return(dat1)
end

@btime updateAA4(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
  1.080 μs (18 allocations: 1.92 KiB)

I restarted Julia before this @btime, i.e. the previous getproperty implementation no longer applies. You could probably write di! in a more elegant fashion (I’m not claiming to be an expert in any of this).

[ EDIT: no view is needed here, as we are just rewriting the code (no function returns involved). ]

Seeing as the point of a StaticArray is that the size is known to the compiler (i.e. before runtime), the short answer is no. If there are only a few possible sizes, it could make sense to use Val, though.

Because of the previous reason, I would again say no. If you don’t value the .col1 or [:, "col1"] indexing too much (like in updateSA; or if you are prepared to use something like a Dict to translate from column_name::String to column_index::Int), I’d just use a simple Matrix.

function updateM(mat1, id1, colnames1, mat2, id2, colnames2)  # (Note: better to write updateM!)
    idxr    = indexin(id2, id1)
    idxcol2 = indexin(colnames1, colnames2)
    upcols1 = [1; 3]
    upcols2 = @view idxcol2[upcols1]

    for i in eachindex(upcols1)
        mat1[idxr, upcols1[i]] += @view mat2[:, upcols2[i]]
    end

    mat1[idxr, 2] += @view(mat2[:, idxcol2[2]]) .> 0
    mat1[idxr, 4] = @view mat2[:, idxcol2[4]]

    return mat1
end

@btime updateM(copy($mat1), $id1, $colnames1, $mat2, $id2, $colnames2);
#  985.714 ns (20 allocations: 2.28 KiB)

[ EDIT: added @views for better performance. ]

Tetrakai · August 20, 2024, 4:47pm

Thanks, that is all very helpful.

I found it was possible to pass global constants as the dimensions to a StaticArray without losing performance. Maybe I could use the maximum number of rows (which is known at compile time) then pad my data with zeros or whatever. I am taking a subset of the rows anyway, so the above benchmark should apply.

Topic		Replies	Views
Unclear sentence in Performance Tips doc page New to Julia documentation	5	521	June 5, 2020
Slow compiling of many AxisArrays Performance	0	233	September 22, 2021
The fate of DimensionalArrays / AxisArrays in Julia, and which to actually use Specific Domains	7	3321	May 19, 2022
Performance difference between two code New to Julia	3	779	January 18, 2017
DataFrames Complexity Data dataframes	5	545	September 23, 2020

Performance vs notation tradeoff: DataFrame vs AxisArray vs StaticArray vs?

Related topics