How would I divide each row of a Julia dataframe by that rows maximum and return a new dataframe

Hi,

I’ve been looking around and messing with map, eachrow and different things but I haven’t been able to figure it out.

Essentially, if I have a N x M dataframe, I want to return a new N x M dataframe, except each value in the new dataframe is each value in the old dataframe divided by the maximum of the row it originally sits in the old dataframe.

For a 1 row dataframe of [2,4,6] it should return [.33,.66,1].

But in my use case mapped to a dataframe with many rows.

Probably not the most efficient solution:

df = DataFrame(a = rand(1:10, 3), b = rand(1:10, 3), c=rand(1:10, 3))
dfn = DataFrame(Float64, 0, 3) ## create a new dataframe with same number of columns
for r in eachrow(df)
    m = collect(r) ./ maximum(r)
    push!(dfn, m)
end

Have you tried

df ./ maximum.(eachrow(df))

May be it is possible to use Matrix instead of DataFrame? Rectangular matrix, filled with the values of the same type it’s, well, matrix.

1 Like

Yeah dataframe eachrow is taking too long damn. How would I do this with a Matrix? There isn’t a eachrow method

There is an eachrow function for matrices. What version of Julia are you using?

1 Like

The eachrow version seems slow. Try the second version below:

julia> foo(df) = df ./ maximum.(eachrow(df));

julia> bar(df) = df ./ [maximum(df[i,:]) for i in 1:size(df,1)];

Performance test:

julia> df = DataFrame([Symbol("c$i") => rand(1000) for i in 1:100]...);

julia> size(df)
(1000, 100)

julia> @btime foo($df);
  5.139 s (55639946 allocations: 1003.91 MiB)

julia> @btime bar($df);
  41.489 ms (555946 allocations: 10.83 MiB)

I suppose it can be golf coded, but generally it can be something like this

m = rand(1000, 100)
function baz!(m)
    for i in axes(m, 1)
        @views m[i, :] .= m[i, :] ./ maximum(m[i, :])
    end
end

@btime baz!($m) # 386.500 μs (3000 allocations: 140.63 KiB)
function fasterbaz!(m)
    m ./= maximum(m; dims = 2)
end
@btime fasterbaz!($m);
  181.505 μs (22 allocations: 8.52 KiB)

Interestingly, my non-allocating version is slower, probably because it’s accessing m’s memory in non-optimal order:

function fasterbaz2!(m)
    ncols = size(m, 2)
    for i in axes(m, 1)
        maxi = maximum(m[i, j] for j in 1:ncols)
        for j in 1:ncols
            m[i, j] /= maxi
        end
    end
    m
end
@btime fasterbaz2!($m);
  338.469 μs (0 allocations: 0 bytes)
1 Like

It can be written as

function baa!(m)
    for i in axes(m, 1)
        mval = -Inf
        for j in axes(m, 2)
            mval = mval < m[i, j] ? m[i, j] : mval
        end
        for j in axes(m, 2)
            m[i, j] /= mval
        end
    end
end

@btime baa!($m)   # 202.932 μs (0 allocations: 0 bytes)

which is still slower then allocating version.

But you gave me an idea


function baa2!(m)
    maxi = Vector{Float64}(undef, size(m, 1))
    @inbounds for i in axes(m, 1)
        mval = -Inf
        for j in axes(m, 2)
            mval = mval < m[i, j] ? m[i, j] : mval
        end
        maxi[i] = mval
    end
    @inbounds for j in axes(m, 2)
        for i in axes(m, 1)
            m[i, j] /= maxi[i]
        end
    end
end

@btime baa2!($m)  # 123.239 μs (1 allocation: 7.94 KiB)
1 Like

I got one more: although it’s getting a bit ridiculous syntax wise :slight_smile:

function fasterbaz3!(m)
    nrows, ncols = size(m)

    maximums = m[:, 1] # copying the first col saves one col in the first iteration haha

    # iterate down the rows first which matches julia's memory layout
    @inbounds for j in 2:ncols, i in 1:nrows
        maximums[i] = max(maximums[i], m[i, j])
    end

    @inbounds for j in 1:ncols, i in 1:nrows
        m[i, j] /= maximums[i]
    end
end
@btime fasterbaz3!($m);
  113.390 μs (1 allocation: 7.94 KiB)
1 Like

haha nice exactly the same moment

1 Like

ok ok very last one! let’s use the fact that multiplications are faster than divisions…

function fasterbaz4!(m)
    nrows, ncols = size(m)

    maximums = m[:, 1]

    @inbounds for j in 2:ncols, i in 1:nrows
        maximums[i] = max(maximums[i], m[i, j])
    end

    # now maximums are actually their inverse for multiplication below
    maximums .= 1 ./ maximums

    @inbounds for j in 1:ncols, i in 1:nrows
        m[i, j] *= maximums[i]
    end
end
@btime fasterbaz4!($m);
  79.004 μs (1 allocation: 7.94 KiB)
1 Like

This is so so so cool!!!

And now my turn


function baa3!(m)
    maxi = m[:, 1]
    ncol = size(m, 2)
    @inbounds for j in 2:ncol
        for i in axes(m, 1)
            maxi[i] = maxi[i] < m[i, j] ? m[i, j] : maxi[i]
        end
    end

    maxi .= 1 ./ maxi
    @inbounds for j in axes(m, 2)
        for i in axes(m, 1)
            m[i, j] *= maxi[i]
        end
    end
end

@btime baa3!($m) # 34.706 μs (1 allocation: 7.94 KiB)
3 Likes

Try changing ? : to ifelse

Wow! I’m shocked at that performance difference. There was definitely going to be a penalty for iterating on a DataFrame but… wow.

Something is going wrong with that maximum broadcast.

julia> foo2(df) = df ./ [maximum(row) for row in eachrow(df)]
foo2 (generic function with 1 method)

julia> @btime foo2($df)
  29.195 ms (554947 allocations: 10.81 MiB)

That being said, it’s a lot better with Tables.rows instead of eachrow.

julia> foo2(df) = df ./ [maximum(row) for row in Tables.rows(df)]
foo2 (generic function with 1 method)

julia> @btime foo2($df)
  12.237 ms (9063 allocations: 985.81 KiB)