How to speed up this DataFrame operation

I have a DataFrame and I would like to store the data in an Array. This constructs the DataFrame:

using DataFrames
using DataFramesMeta
using StatsBase

n = 9^2*2^2*6^2*42

const df = crossjoin(
                DataFrame(am=1:9),
                DataFrame(af=1:9),
                DataFrame(em=1:2),
                DataFrame(ef=1:2),
                DataFrame(km=1:6),
                DataFrame(kf=1:6),
                DataFrame(t=1:42)
                )
insertcols!(df, :prob => rand(n))
insertcols!(df, :w => 1)
for i in 1:1000
    push!(df, (rand(1:9), rand(1:9), rand(1:2), rand(1:2), rand(1:6), rand(1:6), rand(1:42), rand(), sample(0:1)))
end

This stores the data in an Array:

const Pdata = Array{Float64}(undef, (9, 2, 9, 2, 6, 6, 42))
# this loop takes a long time
for idx in CartesianIndices(Pdata)
    am, em, af, ef, km, kf, t = Tuple(idx)
    df1 = @where(df, :t .== t, :am .== am, :af .== af, :em .== em, :ef .== ef, :km .== km, :kf .== kf)
    if size(df1, 1) == 1 && df1[1, :w] > 0
        Pdata[idx] = df[1, :prob]
    elseif size(df1, 1) == 0 || df1[1, :w] == 0
        Pdata[idx] = 0.0
    end
end

Is there a way to speed up that loop? or is there another way to achieve the “transfer” of data from a DataFrame to an Array?

I think you might be looking for a groupby/combine operation.

can you provide an example?

In general

  1. Never write performance-critical code in global scope. See the performance tips here.
  2. When writing performance-critical code in DataFrames, write a function and input the column directly, so Julia can infer types. For example,
    elseif size(df1, 1) == 0 || df1[1, :w] == 0

Julia can’t optimize something like this because it can’t figure out what type df[1, :w] is.

Also, that @where call is very expensive. You are creating a new data frame almost 500,000 times!

It’s not clear exactly what you are doing in that loop, and I think you might have some errors. You could easily have sub-dataframes with size(df, 1) > 1, and definitely have no sub-datafames with size 0. So I modified the function to do something slightly different. Hopefully you can adapt it to your exact use case.

using DataFrames, Chain, DataFramesMeta
df = @chain begin
    Iterators.product(1:9, 1:2, 1:9, 1:2, 1:6, 1:6, 1:42)
    collect
    vec
    DataFrame
    rename!([:am, :em, :af, :ef, :km, :kf, :t])
end;

n = nrow(df)
insertcols!(df, :prob => rand(n))
insertcols!(df, :w => 1)
for i in 1:1000
	# There was a typo on this line, I fixed it. 
    push!(df, (rand(1:9), rand(1:2), rand(1:9), rand(1:2), rand(1:6), rand(1:6), rand(1:42), rand(), sample(0:1)))
end

function make_array(df, var_to_fill)
	Pdata = Array{Float64}(undef, (9, 2, 9, 2, 6, 6, 42))
    @eachrow df begin
    	Pdata[:am, :em, :af, :ef, :km, :kf, :t] = cols(var_to_fill)
    end

    return Pdata
end

@chain df begin 
	groupby(Not([:prob, :w]))
	@combine newprob =  
		if length(:prob) == 1 && first(:w) > 0
			return first(:prob)
		else 
			return 0.0
		end
	make_array(:newprob)
end;

EDIT: This would actually make an excellent benchmark against other data libraries.

4 Likes

what package is that @chain coming from?

Chain.jl, sorry. I added it to the top.

This seems to work. I have to study it more carefully, especially with that @chain magic. Thank you!

1 Like

One thing that confuses me is this line Pdata[:am, :em, :af, :ef, :km, :kf, :t], especially after you define Pdata as an Array. It seems you are accessing the array with symbols.

Yeah, that’s an easy thing to misinterpret. Check out the docs for @eachrow here. It’s metaprogramming. Just like in @where, the Symbols you see are actually columns, in @eachrow the Symbols are actually the values of columns at a given row.

2 Likes

Oh I see, that is fantastic metaprogramming-fu!

Sorry to ask another question. In this block:

@combine newprob =  
		if length(:prob) == 1 && first(:w) > 0
			return first(:prob)
		else 
			return 0.0
		end

I couldn’t find that use of @combine in the documentation [Introduction · DataFramesMeta Documentation]. Is that undocumented? Also confusing me is that you are using return as if there is a function somewhere, but it seems to be an if block. Can you please clarify that?

No it’s not undocumented. It’s just that I spread the expression out over multiple lines. I probably should have written

@combine newprob =  begin
	if length(:prob) == 1 && first(:w) > 0
		return first(:prob)
	else 
		return 0.0
	end
end

Also confusing me is that you are using return as if there is a function somewhere, but it seems to be an if block. Can you please clarify that?

Indeed! Good catch. I copied and pasted from a function I had written earlier before I realized I could just write it out. This an implementation detail that stems from the way DataFramesMeta works.

@combine(df, y = f(:x))

creates the expression

combine(df, :x => (x -> f(x)) => :y)

See how the f(:x) part gets transformed into an anonymous function? So if you have a return in there, it will behave just as a return in a function.

Maybe we should disallow that because it’s confusing. But it probably has it’s benefits, like allowing early returns for faster performance.

1 Like