Using closures for performance gain when handling datasets and making repeated function calls

Hello everyone,

I’m in a situation where I’m estimating parameters of a model using a dataset. This requires optimization, and my question is about how I can leverage closures to speed up the process.

I’d be curious to get feedback on whether the following workflow is efficient. The code is a cooked-up example to illustrate the different types of functions I need, the use of closures, and the repeated calls to a function for optimization purposes. The specific data operations shown here are arbitrary - my project isn’t just multiplying stuff by 10 and 20! - but the order of the operations need to stay the same.

# df is the dataset, which in reality is a .csv file I have to import
df = collect(1:10)


function fn1(dataset)
  data_1 = dataset .* 10 # some arbitrary data manipulation 

  function fn2()
   data_2 = data_1 .* 20 # another arbitrary data manipulation
     
# data work that requires loop
   function looping_fn()

        result_2 = repeat([0],10) 
        for i = 1:3
            result_2 = result_2 .+ data_2 .+ i
        end 
        return result_2
    end

    result = looping_fn()
  end 

# mimics having to call fn2() many times during optimization
 final_result = [fn2() for _ in 1:40] 
end


res = fn1(df) # this gives the final result

Here, df is the dataset I’ll use. fn1() is the function that does all the estimation, so that calling fn1(df) will give me the estimates. Within fn1(), I’ll need to call an objective function fn2() which has to be minimized. Within fn2(), there has to be further algebraic manipulation, one of which involves loops.

The last step of fn1() involves optimizing the objective function, which requires calling fn2() many times, which here I’ve illustrated with calling fn2() repeatedly 40 times.

Are there any suggestions for how I can alter the workflow to speed up the process? I try to use closures because, after reading some of the posts here, it seems they are a way of significantly speeding things up.

Something that bothers me is this. I put the loop inside a looping_fn() because I saw that loops have their own scope, and to extract values from them I wrap them in functions (Error: variable inside a loop). But every time I call fn2(), the looping_fn() has to be declared anew, doesn’t that slow things down? Since the looping_fn() is a closure I get an error when I declare it outside fn1(). (I can attach a code snippet showing this if it helps). Am I missing something about loops here?

It will be easier and clearer if you just define your functions outside, and independently, and pass the data they act on as parameters.

It appears that you are using the closures to avoid global variables while defining functions without parameters. Just pass the parameters.

5 Likes

If you really just want nested scopes, let blocks are the way to go: Essentials · The Julia Language.

1 Like

I agree with leandromartinez98-san.

The following Rev1 is almost the same as the original, but with a few improvements.

  • Make the size of result_2 the same as data_2.
  • With the @. macro, add dots to all operators (including =) in the update of result_2.
# df is the dataset, which in reality is a .csv file I have to import
df = collect(1:10)

module Rev1

function fn1(dataset)
    data_1 = dataset .* 10 # some arbitrary data manipulation

    function fn2()
        data_2 = data_1 .* 20 # another arbitrary data manipulation
    
        # data work that requires loop
        function looping_fn()
            result_2 = zero(data_2) # result_2 should be of same size as data_2
            for i = 1:3
                @. result_2 += data_2 + i # more dots for performance!
            end
            return result_2
        end

        result = looping_fn()
    end

    # mimics having to call fn2() many times during optimization
    final_result = [fn2() for _ in 1:40]
end

end

res1 = Rev1.fn1(df) # this gives the final result()

In the following Rev2, the functions fn2 and looping_fn are moved out of fn1.

module Rev2

function fn1(dataset)
    data_1 = dataset .* 10 # some arbitrary data manipulation
    
    # mimics having to call fn2() many times during optimization
    final_result = [fn2(data_1) for _ in 1:40] 
end

function fn2(data_1)
    data_2 = data_1 .* 20 # another arbitrary data manipulation
    result = looping_fn(data_2)
end 

# data work that requires loop
function looping_fn(data_2)
    result_2 = zero(data_2) # result_2 should be of same size as data_2
    for i = 1:3
        @. result_2 += data_2 + i # more dots for performance!
    end 
    return result_2
end

end

res2 = Rev2.fn1(df) # this gives the final result

The performace of them are same.

using BenchmarkTools
df = collect(1:10^3)
@show Rev1.fn1(df) == Rev2.fn1(df)
@btime Rev1.fn1($df)
@btime Rev2.fn1($df);
Rev1.fn1(df) == Rev2.fn1(df) = true
  52.200 μs (82 allocations: 643.33 KiB)
  52.200 μs (82 allocations: 643.33 KiB)

I prefer Rev2 to Rev1, because Rev2 is simply readable. I think it is often better to stop devising things that are not inevitable and write the code in such a way that all the information needed for the function is passed as arguments, which results in more readable code.

When debugging, I often want to execute each function (including fn2 and looping_fn) for test. Rev2 is also better than Rev1 in this respect.

I’m not sure of the actual details in this example, but reducing memory allocation by pre-allocation can greatly improve performance in many cases. (See Performance Tips · The Julia Language)

3 Likes