Using closures for performance gain when handling datasets and making repeated function calls

Hello everyone,

I’m in a situation where I’m estimating parameters of a model using a dataset. This requires optimization, and my question is about how I can leverage closures to speed up the process.

I’d be curious to get feedback on whether the following workflow is efficient. The code is a cooked-up example to illustrate the different types of functions I need, the use of closures, and the repeated calls to a function for optimization purposes. The specific data operations shown here are arbitrary - my project isn’t just multiplying stuff by 10 and 20! - but the order of the operations need to stay the same.

``````# df is the dataset, which in reality is a .csv file I have to import
df = collect(1:10)

function fn1(dataset)
data_1 = dataset .* 10 # some arbitrary data manipulation

function fn2()
data_2 = data_1 .* 20 # another arbitrary data manipulation

# data work that requires loop
function looping_fn()

result_2 = repeat([0],10)
for i = 1:3
result_2 = result_2 .+ data_2 .+ i
end
return result_2
end

result = looping_fn()
end

# mimics having to call fn2() many times during optimization
final_result = [fn2() for _ in 1:40]
end

res = fn1(df) # this gives the final result
``````

Here, df is the dataset I’ll use. fn1() is the function that does all the estimation, so that calling fn1(df) will give me the estimates. Within fn1(), I’ll need to call an objective function fn2() which has to be minimized. Within fn2(), there has to be further algebraic manipulation, one of which involves loops.

The last step of fn1() involves optimizing the objective function, which requires calling fn2() many times, which here I’ve illustrated with calling fn2() repeatedly 40 times.

Are there any suggestions for how I can alter the workflow to speed up the process? I try to use closures because, after reading some of the posts here, it seems they are a way of significantly speeding things up.

Something that bothers me is this. I put the loop inside a looping_fn() because I saw that loops have their own scope, and to extract values from them I wrap them in functions (Error: variable inside a loop). But every time I call fn2(), the looping_fn() has to be declared anew, doesn’t that slow things down? Since the looping_fn() is a closure I get an error when I declare it outside fn1(). (I can attach a code snippet showing this if it helps). Am I missing something about loops here?

It will be easier and clearer if you just define your functions outside, and independently, and pass the data they act on as parameters.

It appears that you are using the closures to avoid global variables while defining functions without parameters. Just pass the parameters.

5 Likes

If you really just want nested scopes, let blocks are the way to go: Essentials · The Julia Language.

1 Like

I agree with leandromartinez98-san.

The following `Rev1` is almost the same as the original, but with a few improvements.

• Make the size of `result_2` the same as `data_2`.
• With the `@.` macro, add dots to all operators (including `=`) in the update of `result_2`.
``````# df is the dataset, which in reality is a .csv file I have to import
df = collect(1:10)

module Rev1

function fn1(dataset)
data_1 = dataset .* 10 # some arbitrary data manipulation

function fn2()
data_2 = data_1 .* 20 # another arbitrary data manipulation

# data work that requires loop
function looping_fn()
result_2 = zero(data_2) # result_2 should be of same size as data_2
for i = 1:3
@. result_2 += data_2 + i # more dots for performance!
end
return result_2
end

result = looping_fn()
end

# mimics having to call fn2() many times during optimization
final_result = [fn2() for _ in 1:40]
end

end

res1 = Rev1.fn1(df) # this gives the final result()
``````

In the following `Rev2`, the functions `fn2` and `looping_fn` are moved out of `fn1`.

``````module Rev2

function fn1(dataset)
data_1 = dataset .* 10 # some arbitrary data manipulation

# mimics having to call fn2() many times during optimization
final_result = [fn2(data_1) for _ in 1:40]
end

function fn2(data_1)
data_2 = data_1 .* 20 # another arbitrary data manipulation
result = looping_fn(data_2)
end

# data work that requires loop
function looping_fn(data_2)
result_2 = zero(data_2) # result_2 should be of same size as data_2
for i = 1:3
@. result_2 += data_2 + i # more dots for performance!
end
return result_2
end

end

res2 = Rev2.fn1(df) # this gives the final result
``````

The performace of them are same.

``````using BenchmarkTools
df = collect(1:10^3)
@show Rev1.fn1(df) == Rev2.fn1(df)
@btime Rev1.fn1(\$df)
@btime Rev2.fn1(\$df);
``````
``````Rev1.fn1(df) == Rev2.fn1(df) = true
52.200 μs (82 allocations: 643.33 KiB)
52.200 μs (82 allocations: 643.33 KiB)
``````

I prefer `Rev2` to `Rev1`, because `Rev2` is simply readable. I think it is often better to stop devising things that are not inevitable and write the code in such a way that all the information needed for the function is passed as arguments, which results in more readable code.

When debugging, I often want to execute each function (including `fn2` and `looping_fn`) for test. `Rev2` is also better than `Rev1` in this respect.

I’m not sure of the actual details in this example, but reducing memory allocation by pre-allocation can greatly improve performance in many cases. (See Performance Tips · The Julia Language)

3 Likes