How to speed up this DataFrame operation

amrods · March 18, 2021, 11:55pm

I have a DataFrame and I would like to store the data in an Array. This constructs the DataFrame:

using DataFrames
using DataFramesMeta
using StatsBase

n = 9^2*2^2*6^2*42

const df = crossjoin(
                DataFrame(am=1:9),
                DataFrame(af=1:9),
                DataFrame(em=1:2),
                DataFrame(ef=1:2),
                DataFrame(km=1:6),
                DataFrame(kf=1:6),
                DataFrame(t=1:42)
                )
insertcols!(df, :prob => rand(n))
insertcols!(df, :w => 1)
for i in 1:1000
    push!(df, (rand(1:9), rand(1:9), rand(1:2), rand(1:2), rand(1:6), rand(1:6), rand(1:42), rand(), sample(0:1)))
end

This stores the data in an Array:

const Pdata = Array{Float64}(undef, (9, 2, 9, 2, 6, 6, 42))
# this loop takes a long time
for idx in CartesianIndices(Pdata)
    am, em, af, ef, km, kf, t = Tuple(idx)
    df1 = @where(df, :t .== t, :am .== am, :af .== af, :em .== em, :ef .== ef, :km .== km, :kf .== kf)
    if size(df1, 1) == 1 && df1[1, :w] > 0
        Pdata[idx] = df[1, :prob]
    elseif size(df1, 1) == 0 || df1[1, :w] == 0
        Pdata[idx] = 0.0
    end
end

Is there a way to speed up that loop? or is there another way to achieve the “transfer” of data from a DataFrame to an Array?

jzr · March 19, 2021, 3:14am

I think you might be looking for a groupby/combine operation.

amrods · March 19, 2021, 4:49am

can you provide an example?

pdeffebach · March 19, 2021, 5:18am

In general

Never write performance-critical code in global scope. See the performance tips here.
When writing performance-critical code in DataFrames, write a function and input the column directly, so Julia can infer types. For example,

    elseif size(df1, 1) == 0 || df1[1, :w] == 0

Julia can’t optimize something like this because it can’t figure out what type df[1, :w] is.

Also, that @where call is very expensive. You are creating a new data frame almost 500,000 times!

It’s not clear exactly what you are doing in that loop, and I think you might have some errors. You could easily have sub-dataframes with size(df, 1) > 1, and definitely have no sub-datafames with size 0. So I modified the function to do something slightly different. Hopefully you can adapt it to your exact use case.

using DataFrames, Chain, DataFramesMeta
df = @chain begin
    Iterators.product(1:9, 1:2, 1:9, 1:2, 1:6, 1:6, 1:42)
    collect
    vec
    DataFrame
    rename!([:am, :em, :af, :ef, :km, :kf, :t])
end;

n = nrow(df)
insertcols!(df, :prob => rand(n))
insertcols!(df, :w => 1)
for i in 1:1000
	# There was a typo on this line, I fixed it. 
    push!(df, (rand(1:9), rand(1:2), rand(1:9), rand(1:2), rand(1:6), rand(1:6), rand(1:42), rand(), sample(0:1)))
end

function make_array(df, var_to_fill)
	Pdata = Array{Float64}(undef, (9, 2, 9, 2, 6, 6, 42))
    @eachrow df begin
    	Pdata[:am, :em, :af, :ef, :km, :kf, :t] = cols(var_to_fill)
    end

    return Pdata
end

@chain df begin 
	groupby(Not([:prob, :w]))
	@combine newprob =  
		if length(:prob) == 1 && first(:w) > 0
			return first(:prob)
		else 
			return 0.0
		end
	make_array(:newprob)
end;

EDIT: This would actually make an excellent benchmark against other data libraries.

amrods · March 19, 2021, 5:28am

what package is that @chain coming from?

pdeffebach · March 19, 2021, 5:29am

Chain.jl, sorry. I added it to the top.

amrods · March 19, 2021, 5:35am

This seems to work. I have to study it more carefully, especially with that @chain magic. Thank you!

amrods · March 19, 2021, 5:38am

One thing that confuses me is this line Pdata[:am, :em, :af, :ef, :km, :kf, :t], especially after you define Pdata as an Array. It seems you are accessing the array with symbols.

pdeffebach · March 19, 2021, 5:41am

Yeah, that’s an easy thing to misinterpret. Check out the docs for @eachrow here. It’s metaprogramming. Just like in @where, the Symbols you see are actually columns, in @eachrow the Symbols are actually the values of columns at a given row.

amrods · March 19, 2021, 5:49am

Oh I see, that is fantastic metaprogramming-fu!

amrods · March 19, 2021, 6:05am

Sorry to ask another question. In this block:

@combine newprob =  
		if length(:prob) == 1 && first(:w) > 0
			return first(:prob)
		else 
			return 0.0
		end

I couldn’t find that use of @combine in the documentation [Introduction · DataFramesMeta Documentation]. Is that undocumented? Also confusing me is that you are using return as if there is a function somewhere, but it seems to be an if block. Can you please clarify that?

pdeffebach · March 19, 2021, 6:11am

No it’s not undocumented. It’s just that I spread the expression out over multiple lines. I probably should have written

@combine newprob =  begin
	if length(:prob) == 1 && first(:w) > 0
		return first(:prob)
	else 
		return 0.0
	end
end

Also confusing me is that you are using return as if there is a function somewhere, but it seems to be an if block. Can you please clarify that?

Indeed! Good catch. I copied and pasted from a function I had written earlier before I realized I could just write it out. This an implementation detail that stems from the way DataFramesMeta works.

@combine(df, y = f(:x))

creates the expression

combine(df, :x => (x -> f(x)) => :y)

See how the f(:x) part gets transformed into an anonymous function? So if you have a return in there, it will behave just as a return in a function.

Maybe we should disallow that because it’s confusing. But it probably has it’s benefits, like allowing early returns for faster performance.

Topic		Replies	Views
Help with performance tuning this dataframe aggregation Performance	10	722	September 23, 2018
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	315	May 19, 2024
Need for speed: looping over subdataframes to construct lags Performance question , dataframes	6	345	March 18, 2023
Help improving the speed of a DataFrames operation Performance performance , query , dataframes	37	1177	February 22, 2024
Can this be made faster? Performance dataframes	5	541	March 19, 2022

How to speed up this DataFrame operation

Related topics