How would I divide each row of a Julia dataframe by that rows maximum and return a new dataframe

Hi,

I’ve been looking around and messing with map, eachrow and different things but I haven’t been able to figure it out.

Essentially, if I have a N x M dataframe, I want to return a new N x M dataframe, except each value in the new dataframe is each value in the old dataframe divided by the maximum of the row it originally sits in the old dataframe.

For a 1 row dataframe of [2,4,6] it should return [.33,.66,1].

But in my use case mapped to a dataframe with many rows.

Probably not the most efficient solution:

df = DataFrame(a = rand(1:10, 3), b = rand(1:10, 3), c=rand(1:10, 3))
dfn = DataFrame(Float64, 0, 3) ## create a new dataframe with same number of columns
for r in eachrow(df)
    m = collect(r) ./ maximum(r)
    push!(dfn, m)
end

Have you tried

df ./ maximum.(eachrow(df))

May be it is possible to use Matrix instead of DataFrame? Rectangular matrix, filled with the values of the same type it’s, well, matrix.

Yeah dataframe eachrow is taking too long damn. How would I do this with a Matrix? There isn’t a eachrow method

There is an eachrow function for matrices. What version of Julia are you using?

The eachrow version seems slow. Try the second version below:

julia> foo(df) = df ./ maximum.(eachrow(df));

julia> bar(df) = df ./ [maximum(df[i,:]) for i in 1:size(df,1)];

Performance test:

julia> df = DataFrame([Symbol("c$i") => rand(1000) for i in 1:100]...);

julia> size(df)
(1000, 100)

julia> @btime foo($df);
  5.139 s (55639946 allocations: 1003.91 MiB)

julia> @btime bar($df);
  41.489 ms (555946 allocations: 10.83 MiB)

I suppose it can be golf coded, but generally it can be something like this

m = rand(1000, 100)
function baz!(m)
    for i in axes(m, 1)
        @views m[i, :] .= m[i, :] ./ maximum(m[i, :])
    end
end

@btime baz!($m) # 386.500 μs (3000 allocations: 140.63 KiB)
function fasterbaz!(m)
    m ./= maximum(m; dims = 2)
end
@btime fasterbaz!($m);
  181.505 μs (22 allocations: 8.52 KiB)

Interestingly, my non-allocating version is slower, probably because it’s accessing m’s memory in non-optimal order:

function fasterbaz2!(m)
    ncols = size(m, 2)
    for i in axes(m, 1)
        maxi = maximum(m[i, j] for j in 1:ncols)
        for j in 1:ncols
            m[i, j] /= maxi
        end
    end
    m
end
@btime fasterbaz2!($m);
  338.469 μs (0 allocations: 0 bytes)

It can be written as

function baa!(m)
    for i in axes(m, 1)
        mval = -Inf
        for j in axes(m, 2)
            mval = mval < m[i, j] ? m[i, j] : mval
        end
        for j in axes(m, 2)
            m[i, j] /= mval
        end
    end
end

@btime baa!($m)   # 202.932 μs (0 allocations: 0 bytes)

which is still slower then allocating version.

But you gave me an idea


function baa2!(m)
    maxi = Vector{Float64}(undef, size(m, 1))
    @inbounds for i in axes(m, 1)
        mval = -Inf
        for j in axes(m, 2)
            mval = mval < m[i, j] ? m[i, j] : mval
        end
        maxi[i] = mval
    end
    @inbounds for j in axes(m, 2)
        for i in axes(m, 1)
            m[i, j] /= maxi[i]
        end
    end
end

@btime baa2!($m)  # 123.239 μs (1 allocation: 7.94 KiB)

I got one more: although it’s getting a bit ridiculous syntax wise :slight_smile:

function fasterbaz3!(m)
    nrows, ncols = size(m)

    maximums = m[:, 1] # copying the first col saves one col in the first iteration haha

    # iterate down the rows first which matches julia's memory layout
    @inbounds for j in 2:ncols, i in 1:nrows
        maximums[i] = max(maximums[i], m[i, j])
    end

    @inbounds for j in 1:ncols, i in 1:nrows
        m[i, j] /= maximums[i]
    end
end
@btime fasterbaz3!($m);
  113.390 μs (1 allocation: 7.94 KiB)

haha nice exactly the same moment

ok ok very last one! let’s use the fact that multiplications are faster than divisions…

function fasterbaz4!(m)
    nrows, ncols = size(m)

    maximums = m[:, 1]

    @inbounds for j in 2:ncols, i in 1:nrows
        maximums[i] = max(maximums[i], m[i, j])
    end

    # now maximums are actually their inverse for multiplication below
    maximums .= 1 ./ maximums

    @inbounds for j in 1:ncols, i in 1:nrows
        m[i, j] *= maximums[i]
    end
end
@btime fasterbaz4!($m);
  79.004 μs (1 allocation: 7.94 KiB)

This is so so so cool!!!

And now my turn


function baa3!(m)
    maxi = m[:, 1]
    ncol = size(m, 2)
    @inbounds for j in 2:ncol
        for i in axes(m, 1)
            maxi[i] = maxi[i] < m[i, j] ? m[i, j] : maxi[i]
        end
    end

    maxi .= 1 ./ maxi
    @inbounds for j in axes(m, 2)
        for i in axes(m, 1)
            m[i, j] *= maxi[i]
        end
    end
end

@btime baa3!($m) # 34.706 μs (1 allocation: 7.94 KiB)

Try changing ? : to ifelse

Wow! I’m shocked at that performance difference. There was definitely going to be a penalty for iterating on a DataFrame but… wow.

Something is going wrong with that maximum broadcast.

julia> foo2(df) = df ./ [maximum(row) for row in eachrow(df)]
foo2 (generic function with 1 method)

julia> @btime foo2($df)
  29.195 ms (554947 allocations: 10.81 MiB)

That being said, it’s a lot better with Tables.rows instead of eachrow.

julia> foo2(df) = df ./ [maximum(row) for row in Tables.rows(df)]
foo2 (generic function with 1 method)

julia> @btime foo2($df)
  12.237 ms (9063 allocations: 985.81 KiB)